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Chapter  1 


Introduction 

1.1  Background 

The  problems  of  Independent  Component  Analysis  (ICA)  and  Blind  Source  Sepa¬ 
ration  (BSS)  are  emerging  and  appealing  areas  of  research.  Starting  in  mid  1980’s 
in  Signal  Processing  and  Neuroscience  communities,  ICA  and  BSS  fast  became 
the  research  focus  of  many  different  research  communities.  In  their  simplest  for¬ 
mulation,  ICA  and  BSS  refer  to  the  problem  of  retrieving  n  unknown  independent 
random  sources  mixed  through  a  mixing  matrix  and  observed  by  n  sensors  without 
having  or  using  any  prior  information  about  the  source  distributions  or  the  mix¬ 
ing  matrix.  More  specifically,  let  A  be  a  non-singular  matrix  and  s(t)  a  vector  of 
random  independent  sources  and  x(t)  =  As(t)  the  observed  signal.  The  problem 
is  to  estimate  A  (the  mixing  matrix )  or  a  matrix  B  that  we  call  the  un-mixing 
matrix,  such  that  s(t)  =  Bx(t)  is  an  estimate  of  the  source  signals,  by  just  using 
the  samples  of  the  sensed  signal  x(f).  It  is  apt  here  to  mention  that  the  terms 
BSS  and  ICA  are  used  almost  interchangeably,  but  BSS  which  is  more  common 
in  array  processing  literature  usually  refers  to  the  case  that  the  linearity  and  in- 
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dependence  in  the  model  “truly”  hold,  whereas  ICA  refers  to  the  case  that  these 
two  are  just  assumed.  Without  further  description  the  beauty  and  ubiquity  of 
the  problem  is  obvious.  It  is  a  very  natural  and  immediate  problem  in  many  dif¬ 
ferent  subjects:  Bio-Signal  processing,  seismic  signal  processing,  communication 
channels,  array  sensor  processing,  financial  data  analysis,...  all  deal  with  problems 
that  can  be  formulated  in  one  or  the  other  way  as  this  problem.  Till  the  80’s  this 
problem  was  considered  without  enough  attention  paid  to  the  word  “independent” , 
wher e11  independent”  was  identified  by  or  simplified  to  “uncorrelated" .  Although 
the  independence  in  many  cases  is  an  inherent  property  related  to  the  physics  of 
the  problems,  it  was  ignored  mainly  because  of  computational  difficulties,  lack  of 
knowledge  about  Higher  Order  Statistics  (HOS)  at  least  among  engineers,  and 
the  dominance  of  Gaussian  signal  model  assumption  among  researchers.  In  the 
mid  80’s  it  was  noticed  that  by  resorting  to  higher  order  statistics  or  non-linear 
functions  in  measuring  “independence”  it  is  possible  to  achieve  results  that  are  im¬ 
possible  to  have  by  just  using  second  order  statistics  or  linear  correlation.  Needless 
to  say  ICA’s  predecessor  Principal  Component  Analysis  (PCA)  is  based  on  eigen 
decomposition  of  the  data’s  covariance  matrix. 

Since  then  a  huge  amount  of  research  has  been  dedicated  to  this  problem,  its  dif¬ 
ferent  versions  and  extensions.  It  is  easy  to  see  the  influence  of  various  fields  on 
the  ICA/BSS  literature,  just  to  mention:  Statistics,  Statistical  Signal  Processing, 
Neural  Networks,  Optimization  Theory,  Nonlinear  Dynamical  Systems,  Differen¬ 
tial  and  Information  Geometry,  Neuroscience.  As  a  matter  of  fact,  in  this  thesis 
we  pursue  a  path  that  encompasses  elements  of  Statistics,  Differential  Geometry, 
Dynamical  Systems  and  Optimization  theory. 
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1.2  Outlines  and  Contributions 


In  this  thesis  we  develop  new  methods  to  solve  a  problem  known  as  “  Matrix 
Joint  Diagonalization  (JD)”  which  has  applications  in  the  BSS/ICA  problem  in 
addition  to  other  contexts.  The  Joint  Diagonalization  problem  in  one  of  its  forms  is 
simply:  for  a  set  of  symmetric  matrices  {C'*}^1  find  a  non-singular  matrix  B  such 
that  BCiBT,s  are  “as  close  to  diagonal  as  possible”  (for  more  detail  see  Chapter 
3).  This,  for  example,  can  be  encountered  in  the  case  that  all  C/s  are  believed 
theoretically  to  be  of  the  form  (7*  =  AA*AT  for  some  common  non-singular  matrix 
A  and  diagonal  A,-,  but  because  C/s  sample  estimates  are  used  this  can  hold  only 
approximately.  Hence  the  idea  would  be  to  find  a  matrix  B  that  diagonalizes  all 
of  them  simultaneously,  as  much  as  possible.  Our  approach  to  this  problem  is  to 
introduce  suitable  cost  functions  for  the  problem  and  use  ideas  from  differential 
geometry  to  derive  gradient  based  matrix  ordinary  differential  equations  (ODEs)  on 
certain  Riemannian  manifolds  such  that  the  ODEs  solution  can  yield  (or  converge 
to)  a  joint  diagonalizer.  In  deriving  ODEs  for  optimization  purposes  we  mainly 
follow  [Brockett]  and  [Helmke] .  We  also  give  some  discretization  schemes  for  these 
(ODEs)  with  an  eye  to  keeping  the  answers  on  the  underlying  manifolds. 

In  the  context  of  ICA/BSS  problem  the  matrices  to  be  jointly  diagonalized  can 
be  “cumulant  slices”  [Cardosol,  Yeredor],  Applying  the  developed  methods  to  the 
Joint  Diagonalization  of  cumulant  slices,  we  devise  algorithms  for  the  ICA/BSS 
problem,  that  are  effective  for  ICA  in  the  presence  of  Gaussian  noise. 

In  Chapter  2,  we  introduce  and  formulate  the  ICA/BSS  problem,  and  explain 
some  of  the  basic  definitions  about  it.  We  give  the  fundamental  theoretical  results 
about  the  issue  of  identifiability  in  the  ICA/BSS  problem.  We  also  introduce 
measures  of  independence,  definitions  and  some  useful  results  about  cumulants. 
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The  group  structure  of  the  ICA/BSS  problem  is  a  very  important  concept  that  is 
introduced  there.  A  short  survey  of  ICA  problems  is  also  presented. 

In  Chapter  3,  we  elaborate  more  on  the  criteria  of  Joint  Diagonalization  of 
cumulant  matrices.  We  consider  some  of  the  commonly  used  cost  functions  and 
their  validity,  we  introduce  some  new  cost  functions,  as  well. 

In  Chapter  4,  we  provide  very  briefly  the  required  material  from  the  theory  of 
matrix  Lie  groups.  Our  treatment  will  be  very  short  and  informal. 

In  Chapter  5,  following  [Brockett]  and  [Helmke],  we  introduce  a  gradient  flow 
for  joint  diagonalization  by  orthogonal  matrices,  and  consider  their  convergence 
properties.  The  Double  Bracket  [Helmke, Brockett]  formulation  of  joint  diagonal¬ 
ization  is  introduced  there.  We  also  discretize  the  orthogonal  flow  (using  the  Euler 
and  Runge-Kutta  methods)  to  give  a  gradient  based  version  of  the  famous  JADE 
[Cardosol]  algorithm  for  the  BSS/ICA  problem.  We  also  give  some  numerical 
examples  manifesting  the  performance  of  the  developed  algorithms. 

In  Chapter  6,  the  next  contribution  of  this  thesis,  we  develop  flows  for  joint 
diagonalization  by  non- orthogonal  matrices,  and  discretize  them  (using  the  Euler 
and  Runge-Kutta  methods).  More  specifically,  we  derive  flows  on  the  manifold  of 
non-singular  matrices  GL(?r)  and  matrices  with  unity  determinant  SL(n)  for  the 
JD  problem.  We  will  introduce  methods  to  discretize  them  as  well,  so  that  the 
answer  stays  on  the  underlying  manifold  to  a  good  extent. 

In  Chapter  7,  we  suggest  a  class  of  ICA/BSS  algorithms  based  on  the  methods 
developed  in  Chapters  5  and  6.  We  introduce  algorithms  that  whiten  the  data 
(using  second  order  statistics,  i.e  correlations)  in  the  first  step  and  then  search  for 
non- orthogonal  un-mixing  matrices  via  joint  diagonalization  of  a  set  of  cumulant 
slices  of  the  whitened  data,  hence  resorting  to  the  HOS  of  the  data.  This  approach 
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is  different  from  the  existing  non-orthogonal  JD  methods  in  the  ICA/BSS  context 
and  also  different  from  methods  that  use  only  HOS.  In  fact  our  method  uses  the 
benefits  both  of  second  order  statistics  that  are  robust  and  cumulants  which  are 
blind  to  Gaussian  noise.  In  this  chapter  we  also  examine  the  actual  performance 
of  the  developed  methods  by  numerical  simulations.  Chapters  5,6  and  7  constitute 
the  main  contributions  of  this  thesis. 

In  Chapter  8,  we  give  a  summary  of  the  thesis  as  well  as  some  suggestions  for 
future  work. 

In  the  Appendix  we  have  included  some  derivations  that  are  rather  lengthy  to 
be  included  in  main  chapters  or  those  that  their  omission  do  not  harm  the  sequence 
of  the  subjects. 
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Chapter  2 


Preliminaries  about  ICA/BSS 

2.1  Problem  Formulation 

First  we  formulate  the  ICA/BSS  problem  in  a  general  fashion  and  give  different 
relevant  assumptions.  Consider 

x(t)  =  A  *  s  (t)  +  n(f)  (2.1) 

where:  t  denotes  continuous  or  discrete  time,  s  is  an  n  dimensional  random  signal, 
*  indicates  the  linear  convolution  operation,  A  =  A(t,  r)  is  an  m  x  n  convolution 
kernel  or  channel  distortion  matrix,  n (t)  is  an  n-dimensional  additive  noise,  and 
x(t)  is  an  m-dimensional  observed  signal.  For  many  problems  this  model  is  quite 
realistic.  It  should  be  noted  that  because  of  the  presence  of  the  convolution  opera¬ 
tion  in  (2.1)  the  corresponding  restoration  or  inverse  problem  is  also  referred  to  as 
Blind  Deconvolution  problem,  as  we  will  see  with  some  additional  assumptions  this 
problem  reduces  to  the  standard  BSS/ICA  problem.  We  should  also  mention  that 
in  what  is  known  as  Nonlinear  ICA,  the  assumption  of  linear  mixture  is  relaxed, 
although  that  problem  is  much  more  difficult  than  the  customary  ICA  formulation 
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presented  here  [Hyvarinen]. 


2.1.1  Some  Possible  Assumptions  about  the  Model 

Here  we  cite  some  different  possible  assumptions  about  each  of  the  introduced 
variables: 

1.  Assumptions  about  the  Source  Vector  s (i): 

51.  The  source  s  is  a  vector  of  n  independent  stochastic  processes, 

52.  s  is  a  stationary  random  process,  with  zero  mean,  with  non-singular  and 
finite  correlation, 

~S2  .  s  is  a  non- stationary  random  process, 

53.  Each  component  of  s  is  Independently  Identically  Distributed  (i.i.d)  in 
time, 

54.  In  addition  to  S1,S2,S3,  s  has  at  most  one  component  with  Gaussian 
distribution, 

55.1.  n  >  m,  i.e  more  sources  than  sensors, 

55.2.  m  =  n,  the  same  number  of  sources  and  sensors, 

55.3.  n  <  m,  i.e  more  sensors  than  sources, 

S6.  Sources  are  complex  valued. 

2.  Assumptions  about  the  Mixing  Channel  A(t,r ): 

Al.  A  =  A(t,r)  is  a  time-invariant  filter. 

A2.  Impulse  response  A  =  A{t,  r)  is  instantaneous,  i.e.  Ay(t,  r)  =  aij(t)6(t  — 
(r  +  Tjj(t))),  where  5(t)  denotes  the  Dirac  delta  function  or  unit  impulse  in 
the  discrete-time  case. 

A3.  Impulse  response  is  memoryless,  i.e.  Ty( t )  =  0  and  Aij(t,  r)  =  aij(t)6(t  — 


A4.  All  assumptions  Al-3  hold,  that  is  the  mixing  process  is  a  memoryless, 
instantaneous  and  time  independent  one,  so  the  model  (2.1)  boils  down  to: 

x(f)  =  z  +  n  =  As  (t)  +  n (t)  (2-2) 

where  we  assume  that  A  is  full  rank. 

3.  Assumptions  about  the  Noise  Vector  n(t): 

Nl.  Noise  is  Gaussian. 

~N1.  Noise  is  not  Gaussian. 

N2.  Noise  is  stationary. 

N3.  Noise  covariance  matrix  is  known. 

N4.  Noise  and  signal  vector  s  are  statistically  independent. 


Adopting  each  of  these  assumptions  has  significant  impacts  on  solvability  of  the 
problem  and  as  well  as  on  the  corresponding  algorithms.  For  example  condition 
~S2  leads  to  the  problem  of  non-stationary  BBS  [Pharnl],  which  interestingly  is 
solvable  by  resorting  to  only  second  order  statistics.  As  another  example,  under 
~N1  using  higher  order  statistics  (HOS)  is  not  helpful  because  as  well  shall  see  in 
Section  2.4  HOS  are  blind  only  to  Gaussian  noise.  By  adopting  N3  the  estimation  of 
the  correlation  matrix  of  z  will  be  “less  biased”  and  the  subsequent  ICA  algorithm 
will  be  easier,  (see  Section  7.1  for  more  details) 

2.1.2  What  is  the  ICA  Problem? 

Consider  the  model  (2.2)  with  extra  assumptions  Sl-4,  S5.3  and  Nl,2,4,  then  the 
standard  ICA  problem  can  be  stated  as  follows  (we  can  assume  complex  data  and 
channel  but  we  avoid  it).  By  just  observing  the  realizations  of  the  received  signal 


x: 

FI.  Estimate  A  the  mixing  matrix,  or 

F2.  Find  a  matrix  B  such  that  y  =  Bit  is  an  estimate  of  all  or  a  subset  of  sources, 
or 

F3.  Find  a  matrix  B  such  that  the  elements  of  y  =  Bit  are  as  independent  as 
possible. 

In  general  these  statements  are  not  equivalent.  Especially  F3  is  interesting  because 
it  relates  the  restoration  of  the  samples  of  the  data  to  the  statistical  independence 
of  the  restored  signals.  As  we  will  see  in  the  next  section  the  key  idea  in  solving 
this  problem  is  restoring  the  independence. 

2.1.3  Identifiability  Conditions 

Some  immediate  questions  are:  is  it  possible  to  restore  s  exactly?,  are  there  any 
ambiguity  in  the  restoration?,  under  what  conditions  is  the  model  identifiable?. 
To  answer  these  questions  we  simplify  the  model  another  step  and  consider  the 
noiseless  model: 

x  =  As  (2-3) 

Due  to  the  multiplicative  form  of  the  equation,  obviously  any  blind  restoration 
will  be  up  to  a  scaling  factor  unless  we  specify  something  about  the  power  of  the 
signal  or  elements  of  A.  On  the  other  hand  a  priori  we  have  no  information  about 
the  ordering  of  the  elements  in  s,  thus  again  any  blind  restoration  will  be  up  to  a 
permutation  of  the  sources,  as  well.  It  turns  out  that  under  some  extra  conditions 
these  are  the  only  two  ambiguities  or  indeterminacies.  We  state  without  proof  a 
theorem  from  Comon’s  paper  [Comon  1]  in  a  rephrased  manner.  Before  that  we 
give  this  definition: 
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Definition  2.1  Two  vectors  or  matrices  A  and  B  are  called  essentially  the  same 
or  essentially  equal  if  there  is  an  invertible  diagonal  matrix  A  and  a  permutation 
matrix  II  such  that  A  =  Alii?. 

Theorem  2.1  [  Comon  1  ]  Suppose  that  the  model  x  =  As  holds  with  conditions 
Sf  and  S5.3.  If  the  elements  of  y  =  B s  are  independent  for  some  B,  then  BA  is 
essentially  diagonal,  hence  y  and  s  are  essentially  the  same. 

So  this  theorem  states  that  in  the  case  that  s  is  a  random  vector  with  at  most  one 
Gaussian  component  with  some  other  mild  conditions  restoring  independence  and 
separating  the  sources  are  equivalent,  and  this  process  has  an  inherent  ambiguity 
which  is  about  the  scale  and  order  of  the  sources.  This  theorem  is  the  theoretical 
core  for  almost  all  ICA/BSS  algorithms.  That  is,  all  these  algorithms  introduce  a 
criterion  for  independence  and  try  to  optimize  it.  Note  that  this  theorem  relates 
the  BSS  problem  to  the  ICA  problem,  as  well.  The  condition  of  having  at  most  one 
component  with  Gaussian  distribution  is  an  important  requirement,  because  for 
Gaussian  vectors  uncorrelatedness  and  independence  are  equivalent,  and  an  uncor¬ 
related  Gaussian  vector  multiplied  by  an  orthogonal  matrix  remains  independent. 
Thus  if  there  are  more  than  one  Gaussian  components,  then  the  ambiguity  will  be 
up  to  an  orthogonal  matrix  which  is  not  acceptable  in  a  separation  context. 

Now  if  we  consider  the  noisy  case,  we  can  argue  that  in  the  best  case  we  can  es¬ 
timate  B  such  that  BA  =  All  and  even  y  =  BAs  +  Bn  can  have  independent 
components  but  y  and  s  are  not  essentially  the  same,  that  is  through  linear  trans¬ 
formations  we  can  not  restore  the  sources,  although  we  may  be  able  to  restore 
independence  or  find  A.  In  fact  resorting  to  higher  order  statistics,  in  Gaussian 
noise,  we  are  able  to  find  B  such  that  BA  =  All,  but  that  is  not  enough  to  restore 
the  sources  noiselessly. 


10 


In  the  next  section  we  introduce  measures  of  independence  and  their  finite  sample 
estimates. 


2.2  Measures  of  Independence  and  Contrast  Func¬ 
tions 

We  follow  Cardoso’s  formulation  as  in  Chapter  4  of  [Haykin].  Let  x  be  an  n- 
dimensional  random  vector  with  probability  density  function  (p.d.f.)  /x(.)  (  we 
also  write  x  ~  /x(.))  and  let  xp  be  its  independentized  version;  that  is  /xP(.)  = 
rnu /*(•)• if*  /x(0  and  s  /s(.)  then  the  Kullback-Leibler  (KL)  divergence 
between  random  vectors  s  and  x  is  defined  as  D[s  ||  x]  =  f  fs(u )  log(|£|4y)  du  and 
using  the  concavity  of  logarithm  function  it  can  be  proved  that  D[ s  ||  x]  =  0  if  and 
only  if  fs  =  /x  almost  surely  [Cover].  An  important  property  of  D[.  ||  .]  is  that 
for  any  two  n-dimensional  random  vectors  x  and  y  and  non-singular  matrix  A  we 
have: 

D[H  ||  y]  =  D[Ax  ||  Ay] 

that  is  the  KL  divergence  is  invariant  under  one-to-one  linear  transformations. 

The  mutual  information  between  the  elements  of  x  is  defined  as  /[x]  =  D[x  || 
xp|.  So  /[x]  =  0  if  and  only  if  the  elements  of  x  are  independent.  In  fact  it  can  be 
shown  that 

/[x]  =  min  D[x  ||  z\ 

SeV 

where  V  is  the  manifold  1  of  random  vectors  with  independent  components  and 

^^Here  we  use  the  term  “manifold”  in  reference  to  an  infinite  dimensional  manifold  of  proba¬ 
bility  densities  whereas  in  the  subsequent  chapters  we  use  this  term  to  refer  to  finite  dimensional 
manifolds.  For  rigorous  definition  of  the  former  the  reader  is  referred  to  [Amari  1], 
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this  minimum  is  achieved  for  z  =  xp,  i.e.  xp  can  be  considered  as  the  projection  of 
x  onto  V  but  obviously  not  in  a  usual  sense  because  KL  divergence  is  not  a  true 
metric.  So  we  can  rephrase  the  ICA  problem  in  a  compact  way  as: 

min  /[Bxl 

BeRraxmand  full  rank 

Obviously,  this  is  not  a  very  constructive  way!,  but  it  is  an  instructive  one,  that 
the  goal  is  to  minimize  the  output’s  mutual  information.  Later  on  we  will  see 
that  the  space  of  full  rank  n  x  m  matrices  is  “too  big”  for  this  search,  mainly 
because  of  the  scale-permutation  ambiguity  in  the  ICA  problem.  Cardoso  has 
shown  that  a  maximum  likelihood  approach  to  ICA  also  leads  to  the  same  criterion 
[Cardoso3].  There  are  also  other  criteria,  but  they  are  somehow  derived  from 
mutual  information  [Haykin,Hyavarinen]. 

Mutual  information  is  an  example  of  a  contrast  function,  a  function  in  terms 
of  an  unknown  parameter  whose  optimization  (in  this  case  its  minimzation)  with 
respect  to  that  parameter  gives  independence  or  solves  the  ICA  problem.  For 
our  purposes  contrast  functions  are  ICA  cost  functions.  Obviously  a  good  cost 
function  is  one  that  has  only  global  optima  and  all  the  optimizers  are  essentially 
the  same  and  give  independence.  So  a  good  contrast  function  should  be  scale  and 
permutation  invariant  as  the  mutual  information  is. 

Evidently  the  problem  with  the  mutual  information  is  that  it  is  a  mathematical 
expectation  and  it  depends  on  the  p.cl.f  of  x  which  under  the  assumption  of  ICA 
is  not  known!.  There  are  different  ways  to  ameliorate  this  problem,  among  them 
is  approximating  the  mutual  information  based  on  data  samples  and  developing 
contrast  functions  accordingly.  This  is  the  approach  we  will  follow. 
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2.3  Cumulants 


We  review  the  definition  and  properties  of  cumulants  of  random  vectors.  Intuitively 
cumulants  are  higher  order  correlations,  so  they  can  be  considered  as  measures  of 
independence.  In  some  steps  we  follow  [Porat]  in  notations.  For  an  n-dimensional 
real  random  vector  x  the  moment  generating  function  is  defined  as: 


Mx( A)  =  £!{exp(ATx)}  ,  A  E  Rn 


The  cumulant  generating  function  then  is  defined  as  CX(A)  =  logMx(A).  Let 
1  £  H,  *2>  —,ik  <  n  then  the  kth  order  cumulant  of  x  is  defined  as  a  k- way  array 
whose  element  at  the  position  (ii,i2,  is: 


Cum{xil,  ...,x4)  =  Cumx(ii, 


dkCx(X ) 
d\h...d\ ik 


In  the  case  of  real  valued  sources,  cumulants  and  moments  both  are  permutation 
invariant.  It  can  be  proved  that  cumulants  and  moments  can  be  derived  from  each 
other  and  in  fact  in  practice  cumulants  are  estimated  via  estimating  the  moments. 
For  example  the  4th  order  cumulant  for  zero  mean  random  variables,  which  we  will 
use  often,  can  be  expressed  in  terms  of  moments  as: 


CW(xi,X2,X3,x4)  =  £{xix2x3x4}  -  £{xix2}.E{x3X4}  -  F;{xiX3}^{x2x4} 

-£'{x2x3}F;{xiX4} 

In  practice  we  ought  to  estimate  the  cumulants  from  sample  data,  this  is  usually 
based  on  sample  averaging  of  expressions  such  as  above,  which  in  general  is  not 
robust  to  outliers.  It  can  be  shown  that  the  higher  the  order  of  the  estimated 
cumulant  the  larger  the  variance  or  error  of  the  estimation  would  be.  This  is 
one  of  the  main  drawbacks  of  the  cumulant  based  methods,  which  makes  them 
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vulnerable  in  small  sample  sizes.  There  are  other  approaches  and  modifications 
to  sample  averaging  method.  For  example  one  can  filter  out  outliers,  out  of  some 
distance  from  the  mean.  Methods  such  as  this  are  costly  to  apply,  and  in  fact  it 
turns  out  that  a  huge  portion  of  the  computational  cost  in  many  cumulant  based 
algorithms  is  estimating  the  cumulants  rather  than  the  computations  afterwards. 

Some  of  the  properties  of  cumulants  that  are  derived  from  the  above  definition 
are  as  follows: 

Cl.  Cumulants  are  multilinear: 

C'um(xj1, ...,  ax  +  (3y,  ...,xifc)  =  aC'um(xil, ...,  x,  ...,x4)  +  f3Cum(xh,  ...,y,  ...,x4) 
As  a  result,  for  x  =  As  we  have 

C'um(xil,...,xiJ  =  ahh---aikjkCurn(sh,---,sjk) 

ji—jk 

C2.  If  x  and  y  are  any  two  independent  vectors  then  for  cumulants  of  order  k: 

Cumx+y(ii, ...,  ik)  =  Cumx(i i, ...,  ik)  +  Cumy(ii, 

C3.  If  {xjj,  ...jXjj.}  can  be  partitioned  in  two  independent  subsets  then: 

Cum (xj15  ...,XjJ  =  0 

Remark:  If  i\  =  =  ...  =  ik  =  i  then  Cumx(i,  is  called  the  kth  order 

(auto)cumulant  of  the  ith  component  of  x,  otherwise  Cumx(i\ ,  ...,ik)  is  called  cross 
cumulant. 

Corollary:  If  the  components  of  x  are  independent  then  all  the  cross  cumulants 
of  any  order  are  zero. 

C4.  The  Gaussian  random  vector  is  the  only  vector  that  all  its  cumulants  of  order 
k  >  2  are  zero. 
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By  their  multi- linearity,  cumulants  can  be  considered  as  tensors  [McCullagh],  and 
the  ICA  problem  can  be  interpreted  as  tensor  diagonalization.  We  can  restate 
Comon’s  theorem  as: 

Corollary:  In  the  standard  ICA  formulation  (with  assumptions  in  Theorem  2.1) 
x  =  Amxns,  if  B  is  a  matrix  that  makes  the  cumulant  tensor  of  y  =  Bnxm5i 
diagonal  then  BA  =  A P,  where  A  is  a  diagonal  and  P  is  a  permutation  matrix. 
By  a  diagonal  tensor  we  mean  one  with  the  property  that  its  A---A  element  is 
nonzero  only  if  A  =  ...  =  A- 

In  many  cases  we  are  interested  in  cumulant  matrix  slices,  which  we  find  from 
the  cumulant  tensor  by  fixing  all  but  two  of  the  indices.  We  denote  a  cumulant  slice 
as:  Cumx(ii,  ...,ip-i,:,ip+i,  ...,iq-i,:,iq+i...,ik),  notice  the  sign  which  shows 
that  index  ranges  over  all  its  possible  values.  Obviously  any  cumulant  slice  of  a 
real  random  vector  is  symmetric.  Now  we  prove  a  lemma  which  is  essential  to  the 
approach  chosen  in  this  thesis. 

Notation:  denotes  the  Kronecker  delta  and  is  equal  to  1  if  A  =  ...  =  in 

and  is  zero  otherwise. 

Lemma  2.1  ;  Let  x  =  Amxns  and  s  be  an  n-dimensional  vector  with  independent 
components,  then  every  cumulant  slice(matrix)  of  x  is  of  the  form 

Cumx (A ,  * •• ,  ip— i ,  * ,  ip+i ,  iq—i,  ■,  iq+ i •  •*>  A)  AAA 

where  A  is  a  diagonal  matrix  of  the  form: 

0 

A22CAms(2, ...,  2) 

0 

0  A nnCums(n, 


A  = 


AiiCAotis(1,  ...,  1) 
0 
0 
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cmd  A jj  —  n/=i,j^p,9  aH3- 

Proof:  Using  the  second  part  of  Cl  and  the  fact  that  because  of  the  indepen¬ 
dence  of  the  components  of  s,  Cums(i\,  ..,**)  =  8ilt...,i,...,ikCums(i, ...,  i),  we 
have: 

Curn^di , ip—  i,  ip,  ip+ 1,  •••,  iq—i,  iq,  iq+i---,  ik) 
n  k  k 

Y  n  akiCums(i,  ^  aki)  Cums(i , i)  = 

z=l  Z=1  i  l=ll^p,q 

^  ^  &ipi(liqi^iiCuTfls(i:  ...i) 
i 

The  last  summation  as  zp  and  range  between  0  and  m  can  be  written  in  the 
desired  AAAT  form.  A 

Note  that  in  the  above  lemma  A  is  not  restricted  to  be  square.  It  is  interesting 
to  see  that  A  depends  on  both  source  cumulants  and  elements  of  A  for  k  >  2.  In 
the  special  case  of  2nd  order  cumulant  slice,  i.e.  covariance,  A  does  not  depend  on 
the  elements  of  A.  This  difference  is  important,  as  a  covariance  matrix  is  always 
positive  semi-definite  but  a  cumulant  slice  is  not,  necessarily.  This  lemma  implies 
that  all  the  cumulant  slices  of  x  are  diagonalizable  (in  a  congruence  manner)  by 
the  pseudo-inverse  of  A  or  in  the  case  m  =  n  by  A  .  This  observation  is  the  basis 
for  the  ICA/BSS  methods  developed  in  thesis. 

2.4  Cumulant  Based  ICA 

As  we  mentioned  before,  mutual  information  is  a  measure  of  independence  and 
cross  cumulants  show  how  much  the  variables  are  dependent  on  each  other.  So  it 
would  be  interesting  to  see  how  these  measures  are  related  to  each  other. 
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2.4.1  Gaussian  Manifold  and  the  Negentropy 

Here  we  follow  Cardoso  as  in  Chapter  4  of  [Haykin].  Let  TV"  be  the  manifold  of 
zero-mean  Gaussian  random  vectors  .  It  is  easy  to  prove  that  if  n  €  TV  minimizes 
D[x  |]  n]  for  finite  covariance  zero- mean  random  vector  x  then  n]  is  a  Gaussian 
vector  with  the  same  covariance  as  x.  We  denote  this  Gaussianized  version  of  x  as 
xn.  The  quantity  TV[x]  =  ZT[x  ||  xn]  which  measures  the  KL-distance  of  x  from  TV" 
is  called  the  Negentropy  of  x.  We  know  that  for  any  n  G  TV",  and  any  non-singular 
matrix  A,  An  G  TV",  so  using  invariance  of  the  KL  divergence  under  1-1  linear 
transformations  we  have: 


TV[x]  =  minD[x  ||  n]  =  min  D[A5t  ||  4n] 

neA f  AneAT 


min  .D  LAx  II  n]  =  TV  [Ax] 

n  eAT 


thus  TV[x]  is  invariant  under  one-to-one  linear  transformations. 


For  a  scalar  random  variable  z  with  unit  variance,  /[ z]  can  be  approximated  in 


terms  of  its  cumulants  as  [Cornonl]: 


-K3K4 


(2.4) 


where  K3  =  Cum( z,z,z)  and  K4  =  Cum( z,  z,z,z).  The  proof  of  this  is  based  on 
the  approximation  of  the  p.d.f.  of  z  around  the  pdf  of  its  gaussianized  version,  in 
an  expansion  known  as  Edgeworth  expansion  [McCullagh]. 


2.4.2  The  Negentropy  and  Mutual  Information 

It  is  easy  to  show  that  TV[x]  and  7[x]  are  related  as 

/[S]  =  JV[x]  -  jr  W[x,]  +  l  log  (nk|l)  (2.5) 

2=1  ^  ' 

where  R  is  the  covariance  matrix  of  x.  By  the  Hadamard  inequality  [Marcus]  about 
positive  definite  matrices,  for  a  positive  definite  matrix  i?,  log  (  d'et(fl)  /  —  with 
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equality  iff  R  is  diagonal.  In  order  to  minimize  /[fix]  with  respect  to  B  ,  we  note 
that  the  first  term  in  (2.5),  i.e.  N[BH]  is  independent  of  B,  so  we  need  just  to 
minimize  the  other  two  terms.  In  the  sequel  we  show  that  a  separate  minimization 
of  these  two  terms  serves  the  goal  of  ICA. 

2.4.3  Whitening  the  Data 

By  whitening  or  sphering  the  data  we  can  minimize  the  last  term  in  (2.5)  and  also 
confine  the  search  for  un- mixing  matrix  to  the  set  of  orthogonal  matrices. 

For  the  proof  of  the  next  theorem,  we  state  a  useful  lemma  first: 

Lemma  2.2  ;  Let  m  >  n,  and  Pnxm  and  Qnxm,  be  full  rank  matrices.  If  PPT  = 
QQT  then  P  and  Q  are  the  same  up  to  an  orthogonal  factor,  that  is  there  exists 
an  orthogonal  Unxn  such  that  P  =  QU  or  equivalently  Q'P  =  U,  where  Q'  is  the 
right  pseudo-inverse  ofQ. 

Proof:  Let  P  =  VfZiUf  and  Q  =  V^E^J  be  the  Economical  Singular  Value 
Decomposition  (ESVD)  of  P  and  Q,  respectively. (So  is  n  x  n,  //*  is  rn  x  n  and 
E i  are  n  x  n  and  positive  definite).  Then  from  PPT  =  QQT  we  have: 


ViE  lV?  =  V2E  ^V2t 

Hence: 

(e~V2tvle1)(E“V2tv1e1)t  =  inxn 

That  is  (E^1h2rV1Ei)  =  Znxn  for  some  orthogonal  matrix  Z.  So  we  have: 

HiE  xUl  =  V2E  2UZ(U2ZIJT) 
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U  =  U2ZU1  is  orthogonal  so  P  =  QU .  A 


Theorem  2.2  ;  Let  x  =  Amxn s  with  standard  ICA  assumptions  and  let  Rxx 

_  1 

6e  f/ie  correlation  matrix  of  x.  T/ien  t/iere  exists  a  matrix  Rxf  such  that  the 
_  1 

vector  y  =  f?xx  x  is  white  (i.e.  its  correlation  matrix  is  the  identity)  and  with  the 
additional  assumption  that  Rss  =  Inxn  we  have  y  =  Unxns  for  some  orthogonal 
matrix  U . 

Proof:  Let  f?xx  =  VmxnAnxnVT  be  the  ESVD  of  f?xx.  Note  that  by  the  as¬ 
sumptions  that  A  is  full  rank,  m  >  n  and  that  all  the  components  of  s  have 

_  1  j 

nonzero  variance,  f?xx  has  rank  n.  Let’s  define  f?xx  =  A~^VT .  Obviously 

_  1  _  T 

=  ^xx  i?xxi?xx  —  Inxn • 

Now  we  have 

tfxx  =  ARss^4t  =  ArIs(ARIs)t  =  VA^VA^f 

i  1 

So  using  the  previous  lemma  AR£S  =  V A^Unxn  for  some  orthogonal  U.  Then  we 
can  write  A~^VT AR£S  =  Rxx  AR£S  =  U  or  Rxf  A  =  URSS2  ■  Thus 

y  =  /Nx2  x  =  f?xx2  As  =  U  Rss2  s 

_  1 

But,  f?ss2  is  diagonal,  and  due  to  the  scale  indeterminacy  in  the  ICA  problem  we 
can  assume  that  it  is  part  of  the  source  scaling.  So  we  may  write,  y  =  Us.  A 
Note  that  the  assumption  Rss  =  Inxn  is  not  restrictive  due  to  scale  ambiguity  in 
the  ICA  problem.  As  a  matter  of  fact  it  is  more  a  convention  than  an  assumption. 
Reviewing  the  above  proof  reveals  that  we  can  give  an  immediate  generalization 
of  this  theorem  to  “ sphering  in  cumulant  slices ” ,  as  follows: 

Theorem  2.3  (Generalized  sphering):  Let  x  =  Amxns  with  standard  ICA  as¬ 
sumptions  and  let  C  be  any  cumulant  slice  matrix  that  is  positive  semi- definite. 
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Then  there  exists  a  matrix  C  2  such  that  for  the  vector  y  =  C  2  x  we  can  assume 
y  =  Unxn s  /or  some  orthogonal  matrix  U . 

Proof:  Proof  is  essentially  the  same  as  the  previous  one. A 


2.4.4  Whitening  and  Independence 


An  interesting  question  is:  do  we  reduce  the  mutual  information  by  decorellating 
the  data?.  The  following  example  shows  that  it  is  not  the  case  in  general: 
Examplel:  Let’s  consider  two  sources  xx  and  x2  each  uniformly  distributed 
in(— |).  Because  the  mutual  information  is  invariant  with  respect  to  multi¬ 
plication  by  diagonal  and  permutation  matrices  we  consider  two  different  types  of 

cot(0)  1 


mixing  matrices:  one  of  the  form  Ai(0)  = 


1 


cot(0) 


and  the  other  of  the 


form  A2(8) 


cot (9)  —  1 


where  6  €  (0,  |).  Obviously  A2  is  equivalent(as 


|  1  cot(0)  J 

far  as  the  mutual  information  is  concerned)  to  a  rotation  by  6.  We  consider  mixing 
the  sources  under  both  of  these  matrices  and  evaluate  the  mutual  information  of 
the  mixture.  Figure  (2.1)  shows  the  mutual  information  in  terms  of  0  under  the 
two  mixers.  The  graph  shows  that  mixing  under  A2  has  a  supremum  mutual  in¬ 
formation  of  about  .3  for  6  =  On  the  other  hand  there  is  a  9C  below  which  the 
mixture  under  Ai  has  mutual  information  less  than  0.3.  So  if  the  sources  are  mixed 
with  Ai  with  0  <  6C  which  corresponds  to  the  case  that  ”  the  mixing  matrix  is  more 
diagonal  than  certain  amount”  then  by  whitening  the  data  we  may  increase  the 
mutual  information.  Certainly  there  is  a  whitening  matrix  that  makes  the  data 
independent  but  there  is  also  a  whitening  matrix  that  increases  the  mutual  in¬ 
formation  in  this  case.  Beyond  9C  all  the  whitening  matrices  reduce  the  mutual 
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Figure  2.1:  Mutual  information  of  mixtures  A,(0)x  in  terms  of  9,  in  Example  1 

information.  It  should  be  noted  that  the  interesting  fact  that  there  is  a  maximum 
achievable  mutual  information  using  orthogonal  mixing  matrix  is  an  immediate 
consequence  of  the  compactness  of  the  group  of  orthogonal  matrices.  A 

The  conclusion  of  the  example  is  that  in  the  case  where  the  mixture  is  such 
that  the  sources  have  almost  equal  power  contribution  in  the  received  signals  then 
by  whitening  the  data,  the  mutual  information  is  reduced.  In  this  case  and  if  the 
number  of  sources  is  large,  we  can  argue  that  the  received  signal  x,  due  to  Central 
Limit  Theorem  [Papoulis]  is  almost  Gaussian.  Thus,  the  second  term  in  (2.5)  is 
already  zero  and  by  making  the  last  term  equal  to  zero  /[Bx]  will  reduce.  On  the 
other  hand  if  one  source  is  dominant  in  each  sensed  signal,  i.e.  the  mixture  is  not 
fully  mixed,  by  whitening  we  may  mix  more  and  increase  the  mutual  information. 
Thus  in  the  generic  case,  i.e.  in  the  case  that  signals  are  mixed  enough  and  have 
almost  equal  power  contribution  in  the  received  signals,  whitening  or  PCA  can  be 
considered  as  a  first  step  in  the  ICA  in  order  to  reduce  dependence.  Moreover 
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sphering  has  other  benefits  that  we  will  consider  in  Section  7.1. 

2.4.5  A  Contrast  for  White  Signals 

Using  (2.4)  we  can  have  an  approximation  of  the  mutual  information  in  terms  of 
cumulants.  For  the  whitened  signal  y  =  Us  where  U  is  orthogonal  and  Rss  =  Inxn > 
by  (2.4)  and  (2.5)  we  can  have  this  approximation  of  the  mutual  information  in 
terms  of  the  cumulants: 

i[ y]  -  Nlug\  -  jg  X^(44«  +  4»  +  74*  -  64<«i«) 

where  /%*  =  Cumy(i,  i,  i )  and  =  Cumy(i ,  i,  i,  i).  Each  of  these  cumulants  using 
Cl  can  be  written  as  Km  =  Y^=i  ufjCums(i,  i,  i )  and  Kmi  =  YTj=\  ufjCums(i,  i,  i,  i ) 
where  Uij  is  the  ijth  element  of  the  orthogonal  matrix  U.  Afterwards,  considering 
the  fact  that  AT[C/s]  is  independent  of  U  minimizing  the  mutual  information  is  equal 
to  maximization  of  'F(U)  =  X4=i(44j  +  4m  +  7 4»  —  6 K^Km)  with  respect  to  the 
elements  of  the  orthogonal  matrix  U .  Whether  'F(U)  is  such  that  its  minimization 
results  in  independence  is  not  known  [Conron  1],  however  a  simplification  of  'F(tZ) 
is  proved  to  be  a  contrast  [De  Lathauwer2]: 

Theorem  2.4  :  Let  x  =  Unxns,  and  let  U  be  doubly  stochastic,  i.e.  the  2-norm 
of  each  row  or  column  of  U  is  one  and  also  assume  that  s  has  finite  cumulants 
up  to  order  r  and  has  at  most  one  zero  cumulant  of  order  r.  Let  Krj.  be  the  rth 
order  cumulant  of  xi}  and  Ti ,r(U)  =  Y^i=i  lKo*l  ant ^  ^2 ,r{U)  =  Ym=i  4i-  4/  either 
°fYi,r(U)  or  T2jr(/7)  is  maximized  then  U  is  a  permutation,  that  is  T1>r  and  T2,r 
are  contrast  functions  over  the  set  of  doubly  stochastic  matrices  for  random  vectors 
satisfying  above  conditions. 
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Again  we  note  that  the  set  of  doubly  stochastic  matrices  is  a  compact  set  and 
hence  the  cost  function  defined  has  global  minima  on  it.  Because  of  the  fact  that 
any  orthogonal  matrix  is  also  a  doubly  stochastic  matrix  we  have  this  corollary: 
Corollary:  With  the  above  conditions,  if  U  is  orthogonal  then  Ti>r  and  T2,r  are 
contrast  functions. 

Note  that  without  some  sort  of  compactification,  optimizing  functions  of  cu- 
mulants  is  meaningless,  because  cumulants  are  not  scale  invariant.  For  this  reason, 
most  of  the  contrast  function  introduced  in  the  literature  are  over  the  set  of  or¬ 
thogonal  matrices  (see  for  example  [Cardosol],  [Comonl],  [Moreau]). 

It  is  interesting  to  note  that  maximizing  T2:r(C)  for  orthogonal  U  is  equivalent  to 
minimizing  the  sum  of  the  square  of  the  cross  cumulants,  which  is  sensible  because 
cross  cumulants  measure  dependence  among  variables. 

There  are  other  contrast  functions,  among  them  we  will  consider  the  JADE  [Car¬ 
dosol]  contrast  function,  in  the  next  chapter. 

2.5  The  Group  Structure  of  the  ICA  Problem 

An  important  property  of  the  standard  ICA  problem  in  the  case  m  =  n  is  its  group 
or  multiplicative  structure  [Haykin],  [Cardoso  2],  By  the  group  structure  we  mean 
that  starting  from  x  =  As,  A  non-singular,  if  for  a  nonsingular  matrix  K  we  form 
y  =  A'x  =  KAs  =  Cs  then  the  form  of  the  problem  has  not  changed  and  we 
have  not  missed  or  gained  any  information.  Obviously  if  A  and  K  are  orthogonal 
then  C  also  is  orthogonal.  In  fact  many  algorithms  use  this  fact.  For  example  in 
[Cardosol]  and  many  other  papers,  an  iterative  optimization  method  is  used  that 
updates  the  orthogonal  un-mixing  matrix  L4  as  Uk+\  =  H^Uu  where  H is  a  proper 
Jacobi  rotation  matrix.  So  this  way  we  flow  over  the  group  of  orthogonal  matrices 
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at  each  step  as  the  cost  function  is  reduced.  In  Chapter  4  we  will  introduce  the 
concept  of  gradient  flow  over  groups,  which  is  useful  in  contrast  optimization. 

The  group  structure  has  another  implication  that  is  called  equivariance  or  uni¬ 
form  performance  property,  which  enables  us  to  devise  estimators  whose  finite 
sample  performance  in  the  noiseless  case  is  independent  of  the  mixing  matrix,  that 
is  they  will  give  the  same  answer  for  different  mixing  matrices.  Specifically,  let 
Xt  be  a  matrix  of  T  realizations  of  s  then  the  estimator  A(St)  is  equivariance  if 
A(ASt)  =  AA(St)-  Then  for  the  ICA  problem  x  =  As,  the  estimated  sources  will 
be: 

sT  =  A~1(ASt)  ASt  =  A~1(St)A~1ASt  =  A~1(St)St 

which  depends  only  on  the  sources  and  not  the  mixing  matrix.  The  key  point  here 
is  that  if  we  have  contrast  functions  that  depend  only  on  the  output  vector  y  =  Hx 
then  this  property  is  achieved.  Because  for  two  mixing  matrices  A1  and  A2  and 
mixed  signals  x)  =  AiS  and  X2  =  M2s  if  B i  and  B2  minimize  a  function  of  y)  and 
yg,  respectively;  then  B\A\  =  B2A2 ,  that  is  the  restored  signal  will  be  the  same. 
Evidently,  this  is  due  to  the  group  (multiplicative)  structure  of  the  problem. 

2.6  Measures  of  Performance 

The  performance  of  different  BSS/ICA  algorithms  can  be  compared  based  on  their 
separation  ability,  the  computational  cost  or  the  conditions  under  which  they  per¬ 
form  well.  However  in  this  section  by  “Performance  Measure”  we  mean  separa¬ 
tion  performance,  that  is  how  well  the  sources  are  separated  by  an  algorithm.  In 
practice  there  are  different  performance  measures  introduced  (see  Chapter  5  in 
[Haykin]).  Some  of  them  try  to  measure  the  interference  and  noise  at  the  output 
of  the  separator  whereas  some  others  try  to  measure  how  far  the  estimated  un- 
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mixing  matrix  is  from  a  true  one.  In  the  case  of  noisy  ICA  these  two  methods 
are  not  equivalent,  because  asymptotically  we  might  be  able  to  estimate  a  true 
un-mixing  matrix  (using  only  HOS)  but  by  applying  that  un-mixing  matrix  the 
output  noise  is  not  cancelled,  that  is  the  interference  can  be  nulled,  but  the  noise 
can  not.  In  this  thesis  we  follow  the  second  path,  because  it  is  straightforward 
to  compute  therefore  will  compare  algorithms  on  their  performance  in  finding  the 
un-mixing  matrix  in  noisy  ICA.  In  fact  with  the  usual  assumptions  in  the  standard 
ICA  model  we  can  not  do  so  much  about  noise  other  than  cancelling  its  effect  on 
estimating  the  un-mixing  matrix. 

Let  B  be  the  estimated  un-mixing  matrix  and  let  P  =  BA.  If  P  =  IIA  where 
A  is  a  non-singular  diagonal  matrix  and  II  is  a  permutation  matrix,  then  the  un¬ 
mixing  matrix  is  estimated  perfectly.  We  can  measure  the  distance  of  P  from 
permuted  diagonal  matrices,  for  example  as  [Hyvarinen]: 


Index(P)  = 


\Pi; 


Tf  j-f  maxfe  \pik\ 


I  Vi. 


maxfc  \pkj\ 


1)  (2.6) 


Obviously  the  smaller  Inclex(P)  the  better  the  performance  is  and  in  fact  it  is  zero 
if  and  only  if  P  is  a  permuted  diagonal  matrix.  In  performing  simulations  we  can 
average  Index(P)  for  many  different  realizations  and  different  sources  to  get  an 
over  all  performance  assessment  of  an  ICA  algorithm. 


2.7  A  Survey  of  ICA  Algorithms 

There  are  many  different  classes  of  algorithms  and  schools  of  thought  in  solving 
ICA/BSS  problem.  One  categorization  is  to  divide  algorithms  in  two  classes:  online 
and  off-line  or  batch  algorithms.  Online  algorithms  refer  to  the  class  of  algorithms 
that  process  the  data  on  a  sample-by-sample  basis,  whereas  batch  or  off-line  meth- 
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ocls  refer  to  those  algorithms  that  process  blocks  of  data.  In  the  ICA  context  online 
algorithms  usually  implement  a  stochastic  gradient  or  another  form  of  stochastic 
optimization  methods.  The  Amari’s  Natural  Gradient  (see  Chapter  2  in  [Haykin]) 
or  its  equivalent  Cardoso’s  Relative  Gradient  [Cardoso2]  is  an  online  algorithm 
that  uses  the  stochastic  gradient  over  group  of  nonsingular  matrices,  i.e.  the  itera¬ 
tive  algorithm  updates  the  unknown  matrix  by  multiplicative  updates.  FASTICA 
is  an  online  algorithm  that  uses  a  Newton  optimization  method  [Hyavarinen] .  The 
category  of  off-line  methods  are  usually  cumulant  based,  whereas  the  online  algo¬ 
rithms  in  addition  to  the  cumulants  use  some  other  nonlinear  functions  of  the  data. 
Comon’s  ICA  [Cornonl]  and  Cardoso’s  JADE  [Cardosol]  are  among  the  most  well 
known  batch  algorithms.  There  are  also  algorithms  based  on  Joint  Diagonaliza- 
tion  of  cumulant  slices  and  tensor.  In  [Yeroder]  and  [Ziehe]  ICA  by  means  of  joint 
diagonalization  of  cumulant  slices  by  non-orthogonal  matrices  is  addressed  and  in 
this  thesis  we  give  alternatives  for  their  methods.  In  [De  Lathauwerl],  methods 
based  on  the  idea  of  tensor  diagonalization  for  cumulant  tensors  are  developed  . 

On  the  other  hand  there  are  algorithms  that  deal  with  extensions  of  the  ba¬ 
sic  ICA/BSS  model.  For  instance  Pham  and  Cardoso  and  developed  algorithms 
based  on  joint  diagonalization  of  a  set  of  correlation  matrices  for  BSS/ICA  of 
non-stationary  sensors  [Phaml],  There  are  methods  to  deal  with  the  case  of  more 
sensors  than  sources  (see  Chapter  16  in  [Hyavarinen])  .  There  are  schemes  to  con¬ 
sider  sources  with  discrete  values,  i.e.  sources  that  have  values  in  a  finite  set,  as 
in  [Grellier]. 
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Chapter  3 


The  Joint  Diagonalization 
Criterion 


In  this  chapter  we  introduce  the  Joint  Diagonalization  (JD)  of  cumulant  slices 
criterion  and  its  application  in  the  ICA/BSS  problem.  We  introduce  the  JADE 
algorithm  which  is  a  JD  algorithm  seeking  orthogonal  un-mixing  matrix.  We  then 
extend  the  cost  function  of  JADE  to  be  able  to  search  for  non-orthogonal  un-mixing 
matrices  which  is  more  desirable  in  the  noisy  ICA. 

3.1  Joint  Diagonalization  of  Cumulant  Slices  of 
White  Signals 

As  it  was  mentioned  in  Chapter  2,  in  noiseless  ICA  it  is  possible  to  reduce  the 
search  space  for  the  un-mixing  matrix  to  the  set  of  orthogonal  matrices  denoted 
by  O(n),  by  whitening  the  data.  The  main  advantage  of  this  is  that  O(n)  is  a 
compact  multiplicative  group  (in  fact  it  is  a  compact  Lie  group)  [Helmke] .  Because 
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of  its  compactness  we  can  define  cost  functions  that  achieve  their  optima  on  O(n). 
Consider  x  =  Amxn s,  with  the  assumptions  m  >  n,  A  full  rank  and  s  with  indepen¬ 
dent  components  having  at  most  one  component  with  zero  fourth  order  cumulant 
(i.e.  assumptions  S4,  S5.3  from  Section  2.1.1).  After  whitening  x,  by  Theorem  2.2 
we  will  have: 

y(t)  =  Us  Unxn  €  O (n)  (3.1) 

Using  Lenuna  2.1,  for  fourth  order  cumulant  slices  of  the  form  Cumx(:,  '-,i>j)  we 
have: 

Cumy(:,:,i,j)  =  UAijUT  (3.2) 

where 

UnUjiCums{  1,...,1)  0 

0  ui2Uj2Cums(2, ...,  2) 

o  0 

0  uinUjnCums(n^  ...,n) 

So  remembering  the  fact  that  Cumy(:,:,i,j)  is  a  symmetric  matrix,  we  can  say 
that  (3.2)  is  an  Eigen  Decomposition  of  the  symmetric  matrix  Curny ( : ,  :,i,  j).  On 
the  other  hand  if  the  eigenvalues  of  Cumy(:, : ,i,j )  are  distinct  we  know  that  the 
diagonalizer  U  is  unique  up  to  a  column  permutation.  We  notice  that  with  two 
conditions: 

•  U  is  a  generic  (random)  orthogonal  matrix 

•  At  most  one  of  the  Cums(i,i,i,i )  is  zero  (i.e.  zero  fourth  order  cumulant) 
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the  eigenvalues  of  Cumy(:,:,i,j )  are  distinct  and  hence  any  diagonalizer  of  it  is 
essentially  U.  So  by  finding  the  eigen  decomposition  of  any  of  these  cumulant  slices 
we  can  find  U.  However,  in  practice  we  estimate  the  cumulants  by  their  sample 
averages,  so  we  should  not  expect  that  two  different  cumulant  slices  have  the  same 
diagonalizer,  especially  because  the  estimation  of  cumulants  is  very  non- robust. 
Hence,  we  may  jointly  diagonalize  all  or  a  subset  of  the  cumulant  matrix  slices. 

In  [Cardosol],  a  cost  function  for  JD  of  a  set  of  cumulant  slices  is  introduced. 
Let  { Ci }^=1  be  a  subset  of  the  set  of  4th  order  cumulant  slices  of  y.  Then  the  prob¬ 
lem  of  finding  a  joint  diagonalizer  for  {C*}^  can  be  considered  as  a  minimization 
problem: 

N 

min  .7,(0)  =  V  ||0a0T  -  diag(0Q0T)||l  (3.3) 

eeo(n)  ^  11  nF 

i= 1 

where  diag(X)  is  a  diagonal  matrix  whose  diagonal  is  identical  to  X’s  diagonal 
and  ||.||F  is  the  matrix  Frobenius  norm  operator.  In  fact  this  is  a  constrained 
optimization  problem  over  the  group  O(n).  In  different  occasions  we  will  see  how 
the  group  structure  of  the  constraint  set  can  be  exploited. 

Remark:  As  it  is  clear,  in  general  there  exists  no  exact  diagonalizer,  so  a  more 
accurate  title  for  the  subject  is  “ Joint  Approximate  Diagonalization  (JAD)”  as  it 
was  and  still  is  used  in  the  literature.  However  by  “Joint  Diagonalization”  we  also 
imply  the  same  concept  so  we  will  mainly  use  the  latter. 

3.2  The  JADE  Algorithm 

The  Joint  Approximate  Diagonalization  of  Eigen  matrices  or  JADE  is  one  of  the 
earliest  ICA  algorithms  to  use  the  idea  of  joint  diagonalization  of  cumulants  slices. 
The  algorithm  can  be  divided  in  two  separate  parts:  first,  finding  a  set  of  symmetric 
matrices  to  be  diagonalized  and  second  a  joint  diagonalization  algorithm  based  on 
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Jacobi  rotations.  We  will  first  describe  the  joint  diagonalization  algorithm,  because 
it  turns  out  that  this  is  the  most  applicable  and  versatile  part  of  the  algorithm. 
It  should  be  noted  that  in  the  literature  this  joint  diagonalization  algorithm,  itself 
is  called  the  JADE  algorithm,  so  with  some  abuse  of  terminology,  later  on  we  will 
refer  to  this  algorithm  as  JADE. 

3.2.1  Orthogonal  Joint  Diagonalization  Jacobi  Rotations 

(the  so-called  JADE  Algorithm) 

The  method  of  Jacobi  or  Givens  rotations  in  the  context  of  finding  the  eigen 
values  of  a  symmetric  matrix  is  a  well  established  method  [Golub].  An  extension 
of  this  method  can  be  used  to  find  an  approximate  orthogonal  joint  diagonalizer 
for  a  set  of  symmetric  matrices.  The  idea  is  to  compute  0  by  orthogonal 

multiplicative  updates  such  that  at  each  step  the  cost  function  J\  is  reduced.  The 
orthogonal  multiplicative  updates  keep  the  iterate  always  orthogonal.  These  are 
matrices  known  as  Givens  or  Jacobi  rotations  which  have  this  simple  form: 


where  Oik  is  an  angle  that  at  each  step  is  computed  in  order  to  minimize  the  cost 
function.  In  [Cardosol]  the  algorithm  is  proposed  for  complex  and  not  necessarily 
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symmetric  matrices,  here  we  give  a  version  for  symmetric  and  real  matrices.  The 
algorithm  is  coded  in  Table  (3.1). 


Algorithm  3.1: 

1.  Consider  the  set  of  n  x  n  symmetric  matrices  which  for  notational 

convenience  we  denote  as  ,  let  0  =  Jnxn. 


2.  For  1  <  k  <  l  < 

n  do: 

Form  the  N  x  2  matrix: 

ckk  cll 

9  r1 
ACkl 

Gkl  = 

r2  -  r2 
Lkk  cll 

2  r2 
ACkl 

rN  „N 

Lkk  HI 

- 1 

o 

CN 

Compute  V2xi  a  unit-norm  eigen  vector  of  G^Gki  corresponding  to  the 

larger  eigen  value  and  find  9ki  from  v  =  [cos  2 6^  sin2#fc;]r 

Form  Qki  from  (3.4)  and  update  0  < —  0/tz0  and  Cl  < —  @kiCzQ^. 

3.  If  ’’required”  goto  step  2  else  stop. 

Table  3.1:  The  so-called  JADE  algorithm  for  joint  diagonalization  of  the  set  { C"  } 1  by  an  orthogonal  matrix. 
The  unspecified  parameters  and  qualities  are  to  be  decided  in  practice. 


Each  run  of  the  step  2  is  called  a  sweep.  The  appearance  of  an  eigen  vector 
of  Gki  is  a  result  of  a  simple  constrained  quadratic  minimization  of  Ji{Qki)  which 
is  a  function  of  only  0ki  .  We  note  that  in  the  light  of  the  group  structure  of 
the  problem,  a  sequential  optimization  is  possible  and  the  orthogonality  of  0  is 
guaranteed  by  construction. 
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3.2.2  The  JADE  Algorithm 


Now  we  state  the  original  JADE  algorithm  for  the  BSS  problem.  The  steps  in  the 
original  JADE  algorithm  are  as  follows: 

_  1 

1.  Whiten  the  mixed  signal  x  to  get  y  =  /?Xx  x. 

2.  Estimate  Cumy  the  4th  order  cumulant  tensor  of  y. 

3.  Find  the  eigen  matrices  corresponding  to  the  n  largest  eigen  values 

of  Cumy. 

Remark:  Note  that  any  4th  order  n-dimensional  tensor  is  a  linear  mapping 
from  Rnxn  to  Rnxn,  i.e.  from  the  space  ofnxn  matrices  to  the  space  of  n  x  n 
matrices.  If  this  mapping  is  symmetric  or  self-adjoint,  as  a  cumulant  tensor 
is  ,  then  there  exist  n2  eigen  matrices  that  are  orthogonal  to  each  other  and 
can  represent  the  tensor.  With  the  condition  y  =  Us  and  the  independence 
of  the  components  of  s  ,  it  is  shown  in  [Cardosol]  that  only  n  eigen  values  of 
Cumy  are  nonzero,  and  also  it  is  shown  that  the  joint  diagonalization  of  n2 
matrix  slices  of  Cumy  is  equivalent  to  the  joint  diagonalization  of  {£;,;}™=1. 

4.  Use  Algorithm  3,1,  to  find  0  the  orthogonal  joint  diagonalizer  of  ,  by 

minimizing  the  cost  function: 

n 

A(0)  =  5]  || ®EiOT  -  diag(0^0T)||^ 

i=l 

In  practice  because  the  linear  model  may  not  hold  exactly  and  that  the  cumulant 
estimates  are  not  accurate,  step  3  is  not  justified  and  is  neglected  and  that  is  why  we 
did  not  delve  more  into  computing  eigen  matrices.  Hence  in  step  4  a  set  of  matrix 
slices  of  Cumy  is  used,  for  example  the  set  J\f  =  {Cumy(:,  <  i,j  <  n}. 
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Remark:  It  is  possible  to  associate  weights  to  each  term  in  the  above  cost  function. 
Hence  a  more  general  form  of  Ji  is: 

n 

Ji(0)  =  -  diag(eaeT)||^ 

i=i 

where  Wi  >  0  and  to*  =  1.  Note  that  this  is  equivalent  to  considering  the  set 
{^/wlCi}^  instead  of  {Ci=i}^=l  in  the  original  form  of  J\.  Hence  the  cost  function 
with  weights  can  be  reduced  to  one  without  weights. 


3.3  Non-Orthogonal  Joint  Diagonalization 

The  main  drawback  of  the  JADE  algorithm  and  other  ICA  algorithms  that  whiten 
the  data  and  confine  the  search  space  to  orthogonal  matrices  is  that  in  the  pres¬ 
ence  of  noise  the  whitening  process  is  biased,  due  to  the  fact  the  noise  covariance 
is  unknown,  and  this  bias  cannot  be  compensated  in  the  subsequent  orthogonal 
search.  On  the  other  hand  the  Ath  order  cumulants  are  blind  to  Gaussian  noise,  so 
a  method  that  uses  only  higher  order  cumulants  can  suffer  less  from  this  problem. 
One  approach  based  on  JD  is  to  search  for  a  non-orthogonal  joint  diagonalizer.  This 
can  be  justified  by  Lemma  2.1  which  states  that  all  the  cumulant  matrix  slices  of 
any  order  are  diagonalizable  in  a  congruence  manner  by  the  pseudo-inverse  of  the 
mixing  matrix.  To  be  able  to  have  an  algorithm  for  joint  diagonalization  of  a  set 
of  matrices  we  should  have  suitable  cost  functions  for  joint  diagonalization. 

In  developing  JD  cost  functions  for  a  set  of  symmetric  matrices  {CJ^,  we  may 
follow  two  approaches: 

1.  Find  a  matrix  B  such  that  {BCiB7^}^  are  as  diagonal  as  possible,  or 

2.  Find  W  and  diagonal  matrices  {Ai}^=1  such  that  C,,  ~  W AiWT  as  much  as 
possible  for  all  1  <  i  <  N. 
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The  first  approach  deals  with  the  un-mixing  matrix  directly  but  the  second  one 
looks  for  an  estimate  of  the  mixing  matrix.  The  second  approach  has  the  drawback 
that  because  if  C*  ~  IT A* IT  then  C*  ~  (W hr1)  AAjA  {W hr1)T ,  for  any  diagonal 
matrix  A,  it  is  possible  that  IT  is  very  bad  conditioned,  then  its  inverse  will  be 
hard  to  find. 

We  can  use  different  measures  for  approximating  the  “closeness”.  One  such 
measure  can  be  the  Frobenius  norm  of  the  error,  which  has  the  advantage  of 
analytical  convenience.  So  we  can  have  the  corresponding  cost  functions  for  the 
first  approach  as: 

N 

min  J\(B)  =  ^  || BCiBT  —  diag(.BC'i.E>T)||^  (3.5) 

i= 1 

where  B  is  a  full  row-rank  n  x  m  matrix  (we  assume  n  <  m).  Note  that  this 
is  exactly  the  cost  function  we  used  for  the  orthogonal  case,  but  now  on  a  larger 
space.  A  cost  function  corresponding  to  the  second  approach  used  in  [Yeredor]  is 

N 

min  J2(W,  {Ai}^)  =  ]T  \\Ci  -  WhtWT ||*  (3.6) 

i= 1 

where  IT  belongs  to  the  manifold  of  full  column-rank  m  x  n  matrices  and  A,-  are 
diagonal  n  x  n  matrices. 

3.3.1  Square  Mixing  Matrices 

So  far  we  have  considered  rectangular  mixing  matrices,  from  now  on  we  consider 
only  square  mixing  matrices.  In  fact  this  restriction  is  not  very  prohibitive,  because 
by  whitening  the  data  we  can  reduce  the  number  of  sensors  to  the  number  of 
sources.  On  the  other  hand  we  may  argue  that  we  could  use  the  redundancy  in 
the  data  in  cunrulant  domain  as  well.  The  main  reason  that  we  consider  only 
square  matrices  is  that  in  the  subsequent  chapters  we  intend  to  use  the  group 
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structure  of  the  BSS/ICA  problem  aud  develop  methods  in  matrix  groups,  hence 
the  assumption  of  having  an  inverse  in  the  group-theoretic  sense  is  important  and 
only  square  matrices  have  this  property.  So  from  now  on  we  assume  that  the  mixing 
matrix  is  square  (m  =  n)  unless  otherwise  stated,..  Yet  it  is  possible  to  extend  the 
idea  of  group  structure  in  a  similar  form  to  non-square  case  as  in  [Zhang], 

3.3.2  Two  Desired  Properties  for  JD  Cost  Functions 

In  the  context  of  ICA/BSS  we  expect  a  cost  function  for  JD  to  have  two  properties: 

1.  It  should  be  scale  invariant,  i.e.  it  should  not  change  by  diagonal  mixing  or 
un-mixing  matrices, 

2.  It  should  be  invariant  under  permutations,  i.e.  it  should  not  change  by 
permutation  mixing  or  un-mixing  matrices. 

These  two  properties  represent  the  inherent  indeterminacies  in  the  ICA/BSS  prob¬ 
lem,  and  resemble  the  properties  of  the  mutual  information.  We  recall  that 
/[x]  =  /[Ax]  for  any  non-singular  diagonal  matrix  A.  Obviously  J\  does  not 
have  the  first  property  but  has  the  second  one.  J2  on  the  other  hand  has  the  first 
property  in  an  extended  sense,  because  J2(ITA,  {A_1AiA“1}]/=1)  =  J2(W,  {Aj}^) 
for  any  non-singular  Wnxn,  i.e.  we  can  keep  the  cost  function  unchanged  under 
diagonal  mixing.  J2  is  obviously  unchanged  under  permutations  as  well. 

Remark:  We  can  derive  JD  cost  functions  either  in  terms  of  an  un-nrixing  matrix 
B  or  the  (estimated)  mixing  matrix  W .  If  J  is  a  scale-invariant  JD  cost  function 
then  J( A)  =  J(Inxn )•  If  J  is  in  terms  of  B  this  translates  to  J(AB)  =  J(B)  and  if 
it  is  in  terms  of  W  it  translates  to  J(W A)  =  J(W).  Therefore  scale-invariance  in 
terms  of  the  un-nrixing  matrix  refers  to  multiplication  by  a  diagonal  matrix  from 
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the  left  and  in  terms  of  the  mixing  matrix  it  refers  to  multiplication  from  the  right 
by  a  diagonal  matrix.  In  this  thesis  we  are  mainly  interested  in  cost  functions  in 
terms  of  un-mixing  matrices. 


3.3.3  Examples  of  Scale  and  Permutation  Invariant  Cost 
Functions 


In  this  section  we  introduce  some  cost  functions  for  non-orthogonal  JD  that  have 
the  scale  and  permutation  invariance  property. 

Example  1:  Ji(B)  is  not  scale  invariant  on  the  set  of  n  x  n  non-singular  matrices 
as  mentioned  above,  but  it  is  scale  invariant  if  we  restrict  B  to  belong  to  O(n).  It 
is  permutation  invariant  on  the  set  of  n  x  n  non-singular  matrices 
Example  2  [Pham2]:  The  cost  function 


V-!  ( det  diag(BCiBT)  ^ 

MB)  =  2>g(  det  BC<Br  ) 


is  not  well  defined  in  general,  but  if  {(7*= i}^=1  are  positive  definite,  then  it  is  always 
non-negative  and  reaches  zero  when  all  {C*}  is  jointly  diagonalizable.  This  cost 
function  has  the  property  that  is  scale  invariant  :  and  also  Js(AB)  =  J3(I?),  for 
any  non-singular  diagonal  A.  It  scale-invariant  too. 

Example  3:  Another  form  of  the  above  cost  function  with  Frobenius  norm  is 

N 

MB)  =  Y  ||JBC5T(diag(5C;5T))-1  -  I\\l 

i= 1 


This  cost  function  is  scale  invariant  if  CV s  are  positive  definite.  Note  that  this  is 
required  to  ensure  that  diag {BHiBT)  is  invertible.  We  also  have  MAB)  =  J4(i?) 
for  any  non-singular  diagonal  A.  It  is  also  permutation  invariant. 
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Example  4:  A  cost  function  that  somehow  normalizes  J\  is 

N 

MB)  =  Y  II Ci  -  B^1  dia,g(BCiBT) B~T (3.7) 

4=1 

This  cost  function  does  not  require  positive  definiteness  of  C*.  It  is  scale  and  per¬ 
mutation  invariant.  We  also  have  Js(AB)  =  J5(B),  for  any  non-singular  diagonal 
A.  Its  drawback  is  that  it  uses  B -1  together  with  B  which  imposes  more  compu¬ 
tational  cost  in  its  minimization. 

Remark:  Some  of  the  above  cost  functions  require  that  be  positive  definite. 

In  fact  as  we  mentioned  before  cumulant  slices  are  not  necessarily  positive  definite 
so  these  cost  functions  are  not  suitable  for  joint  diagonalization  of  cumulant  slices, 
but  they  can  be  used  to  jointly  diagonalize  a  set  of  correlation  matrices.  Joint  di¬ 
agonalization  of  correlation  matrices  is  a  useful  tool  in  separation  of  non-stationary 
sources  [Phaml]. 

3.3.4  Dealing  with  Cost  Functions  that  are  not  Scale-Invariant 

The  main  problem  with  a  joint  diagonalization  cost  function  that  like  J\  is  not 
scale  invariant  is  that  it  can  be  reduced  by  a  diagonal  matrix,  which  in  the  context 
of  BSS/ICA  does  not  result  in  independence.  In  Chapter  6  we  give  some  methods 
to  deal  with  this  issue.  The  main  idea  there  is  to  somehow  identify  A B  with  B 
for  all  non-singular  diagonal  matrices  A.  This  way  we  will  exclude  all  un-mixing 
matrices  that  are  essentially  the  same  as  B.  This  problem  can  be  related  also  to 
the  fact  that  the  set  of  non-singular  n  x  n  matrices  GL(n)  is  not  a  compact  set 
and  a  priori  we  have  no  guarantee  that  J\{B)  has  a  minimum  on  GL(n).  In  fact 
Ji(B)  has  a  global  infimum  of  zero  at  B  =  0  (which  is  not  in  GL(n))  and  we  will 
show  in  Section  6.1  that  in  the  case  that  the  set  {C(}(^1  is  not  diagonalizable,  J\ 
does  not  have  any  local  minima.  The  reason  for  this  is  simply  the  fact  that  at  any 
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point  the  cost  function  can  be  reduced  by  a  diagonal  matrix.  So  the  identification 
of  B  and  A B  can  be  considered  as  a  method  for  compactification  of  GL(n). 
Another  approach  maybe  to  add  penalty  terms  to  the  cost  function  which  accounts 
for  the  non-compactness  issue,  for  example: 

Ji(B)  =  MB)  ~  log  (I  det  -B|) 

has  the  property  that  penalizes  the  cost  when  |  det  (.E>)|  becomes  small  .  This  way 
we  can  exclude  the  global  infimum  of  J\  at  B  =  0. 
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Chapter  4 


Matrix  Lie  Groups 


In  this  chapter  we  briefly  introduce  the  necessary  tools  from  the  theory  of  differ¬ 
entiable  manifolds  and  Lie  Groups  needed  in  the  subsequent  chapters. 

4.1  Introduction 

The  theory  of  differentiable  manifolds  is  a  part  of  the  of  field  of  differential  geome¬ 
try.  Differential  geometry  is  an  important  mathematical  discipline  in  dealing  with 
nonlinear  systems.  Lie  theory  deals  with  manifolds  that  have  group  structure.  To 
define  abstract  groups  we  do  not  need  any  topological  requirement,  but  Lie  groups 
are  in  some  sense  continuous  groups,  i.e. ,  close  to  any  element  of  the  group  we 
can  have  another  element.  Exact  treatment  of  differentiable  manifolds  and  general 
Lie  groups  requires  lots  of  technicalities  and  is  quite  complicated.  In  this  chapter 
we  consider  only  matrix  Lie  groups  in  order  to  furnish  the  necessary  tools  needed 
for  our  approach  to  the  matrix  joint  diagonalization  problem.  We  also  give  some 
results  for  differential  equations  on  manifolds. 
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4.2  Riemannian  Manifolds,  Tangent  Spaces  and 


Gradients 

For  the  purposes  of  this  thesis  we  define  a  (smooth)  manifold  as  follows: 

Definition  4.1  Consider  the  set  Mn  C  .  Assume  that  there  exists  a  collection 
of  open  subsets  of  Mn  such  as  {Ua}  that  cover  M  and  each  Ua  is  homeomorphic  to 
an  open  set  Rn,'  i.e  there  exits  a  continuous  and  one-to-one  map  with  continuous 
inverse,  pa  :  Ua  — *  Rn.  Assume  further  that  if  Ua  H  Up  is  non-empty  then  the 
transition  maps  c pa0(Pp  '■  Tp{UaC\Up)  — *  pa{UaC\Up)  and  ippotp^1  :  ipa(UaC\Up)  — > 
Pp(Ua  fl  Up)  are  smooth.  The  set  Mn  together  will  all  possible  such  subsets  and 
maps  is  called  an  n-dimensional  smooth  manifold.  The  set  Ua  is  called  a  chart 
and  ip a  is  its  corresponding  (local)  coordinate  function  and  the  function  pff  is  its 
corresponding  (local)  parameterization. 

Intuitively  this  means  that  M  locally  looks  like  Rn,  yet  it  is  not  flat  in  the 
same  manner  that  R”  is.  From  this  definition  the  graph  of  any  smooth  func¬ 
tion  /  :  R"  — »  R  is  an  n  dimensional  smooth  manifold.  The  same  is  true  for  a 
sphere  in  R3,  which  is  a  2-dimensional  manifold.  Note  that  to  describe  the  points 
on  a  sphere  in  the  above  form  we  need  to  use  more  than  one  local  coordinate  sys¬ 
tem  to  describe  the  whole  sphere.  On  the  other  hand  the  graph  of  the  function  |x|, 
i.e.  the  points  (x,  |x|)  do  not  form  a  smooth  manifold.  There  are  manifolds  more 
interesting  than  the  ones  mentioned.  For  example,  the  set  of  non-singular  n  x  n 

matrices  known  as  GL(n)  is  an  n2-dimensional  manifold  in  R™2,  where  singular 

_  2 

matrices  correspond  to  “holes”  and  are  excluded  from  Rn  ,  or  the  set  of  orthog¬ 
onal  n  x  n  matrices  is  an  dimensional  manifold  in  R”2.  We  will  consider 

matrix  manifolds  in  more  detail  later. 
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Remark:  Note  that  from  the  above  definition  the  collection  of  charts  and  coordi¬ 
nate  functions  are  maximal  in  the  sense  that  a  manifold  contains  all  the  possible 
charts  that  are  compatible  with  its  structure.  Therefore  a  point  p  E  M  we  have  in 
particular  a  pair  (Up,<pp)  with  pp(p)  =  0  €  Mn. 

4.2.1  Tangent  Space 

It  should  be  noted  that  a  manifold  by  its  own  does  not  need  to  have  any  algebraic 
properties  and  in  fact  manifolds  are  not  in  general  closed  under  addition  or  multi¬ 
plication  or  other  algebraic  operations.  They  are  merely  curved  geometric  objects 
that  are  smooth  enough.  Yet  one  can  associate  to  them  some  sets  that  have  nice 
properties.  The  most  immediate  one  is  a  tangent  space  to  a  manifold  at  a  point  on 
the  manifold.  Tangent  space,  intuitively  is  nothing  but  the  linear  approximation  of 
the  manifold  at  a  point,  and  it  can  be  described  based  on  the  first  order  derivative 
of  the  local  parameterization  function: 

Definition  4.2  :  Let  Mn  C  R"+fe  be  a  smooth  manifold.  Let  ip  be  a  local  co¬ 
ordinate  function  on  a  neighborhood  around  p,  then  the  columns  of  the  Jacobian 
matrix  centered  at  p  span  a  linear  space  of  dimension  n,  which  is  called  the 
tangent  space  at  p  denoted  by  TpM  and  any  vector  in  this  space  is  called  a  tangent 
vector.  The  set  TM  =  |J  TpM  is  called  the  tangent  bundle  of  M.  A  function 

peM 

X  :  M  — >  TM  that  to  any  point  p  E  M  assigns  tangent  vector  X(p)  E  TPM  in  a 
smooth  fashion  is  called  a  smooth  vector  field. 

It  can  be  shown  that  the  above  definition  is  a  geometric  one  and  independent  of 
the  local  coordinate  function  used.  Tangent  vectors  are  identified  by  their  effect, 
which  is  to  give  the  directional  derivative  of  a  function  on  the  manifold  in  their 
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directions.  It  is  possible  to  relate  the  above  definition  to  the  concept  of  a  curve  on 
a  manifold.  A  curve  is  a  one  dimensional  object  defined  as: 

Definition  4.3  :  A  curve  through  a  point  p  €  M  is  a  smooth  map  7  :  (—6,  b )  — > 
M ,  b  >  0  such  that  7(0)  =  p.  Usually  we  denote  the  independent  real  variable  in 
the  definition  as  time  t  and  the  curve  will  be  7  (t). 

Let  Up  be  a  neighborhood  around  p  with  the  coordinate  function  p.  Let  p^1  denote 
the  local  parameterization  function  where  all  but  ith  component  are  fixed  to  zero 
and  Xi  is  the  only  variable.  Then  p~l  defines  a  curve  on  M.  Assume  that  7 (t)  €  Up 
for  —  b  <  t  <  b  then  the  composition  p  07:  (—6,  b)  C  R  — >  p(Up)  C  1"  is  a  vector 
valued  function  of  the  variable  t  on  an  open  interval  around  t  =  0,  so  we  can  define 
its  time  derivative  at  t=0  in  terms  of  the  time  derivative  of  each  of  its  components 
Pi  o  7.  The  time  derivative  or  velocity  of  the  curve  7 (t)  at  t  =  0  or  at  p.  can  be 
defined  as 


d'y 

d(p  1  0  p  0  7) 

_  \  ^  dp  1 

d(pi  0  7) 

dt 

t= 0 

dt 

*  ^  (~)rr  ■ 

t= 0  i= 1 

*=¥>(7(0))  ^ 

n  ,  -1 
d<Pi 

^  dxi 


i= 1 


d(tfi  o  7) 


*=^(7(0)) 


dt 


t= 0 


Thus  the  velocity  vector  of  the  curve  7 (t)  at  t  =  0  is  a  linear  combination  of  the  the 
tangent  vectors  or  the  velocity  vectors  of  the  curves  at  t  =  0.  Therefore  it  is  a 
tangent  vector  as  well.  This  definition  is  independent  of  the  local  coordinate  used. 

a  —  ^ 

Note  that  above  is  in  fact  an  expansion  of  the  velocity  vector  in  the  basis  {  , 

so  the  vector  (0), ...,  (0)]T  is  the  representation  of  the  velocity  vector 

of  7  at  p  corresponding  to  the  local  coordinate  function  p.  We  can  also  define  the 
directional  derivative  of  a  function  /  :  Mn  — >  R  along  a  curve  7  at  7(0)  =  p  based 
on  the  chain  ride  as  : 
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Df\m  =  (7 './)  (p)  = 


d(fo'y) 


clt 


t= 0 


E 

2=1 


<9/(<A 


-i\ 


dxi 


d{ifi  o  7) 


ic=c/?(p)=0 


dt 


t= 0 


Note  that  the  directional  derivative  depends  on  how  /  changes  on  M  at  p  and  the 
velocity  of  the  curve  at  p. 


4.2.2  Riemannian  Manifolds 

We  emphasize  again  that  TpM  is  a  linear  vector  space  at  any  point.  Therefore  we 
can  equip  TpM  at  any  point  with  an  inner  product: 

Definition  4.4  A  Riemannian  metric  on  Mn  is  a  family  of  inner  products  (.,.)p 
defined  on  each  tangent  space  TPM ,  such  that  it  depends  smoothly  onp  6  M .  When 
endowed  with  a  Riemannian  metric,  M  is  called  a  Riemannian  munifold. 

We  recall  that  any  inner  product  on  TpM  has  a  positive  definite  matrix  represen¬ 
tation.  That  is  if  two  tangent  vectors  in  £,  77  G  TpM  have  representations  u  and  v 
in  Rn,  respectively,  then  (£,  ij)p  =  uTQpv  where  Qp  is  the  positive  definite  matrix 
representing  the  Riemannian  metric  at  p. 

Let  /  be  a  smooth  function  /  :  M  — >  R.  It  can  be  shown  that  there  exists  a 
unique  smooth  vector  held  V/  :  M  — >  TM,  p  1— >  V/(p)  such  that  for  a  point  p 
and  any  smooth  curve  7 (t)  with  7(0)  =  p  we  have  (7*/)  (p)  =  (V/(p),  7(0))p  where 
7(0)  =  4* (f  =  0).  This  vector  held  is  called  the  gradient  vector  held.  Note  that 
this  vector  held  depends  on  the  Riemannian  metric  used,  and  in  fact  the  above 
equation  is  the  defining  equation  for  the  gradient  vector  held.  The  result  of  this 
dehnition  is  that  if  the  function  /  has  a  local  minimum  at  po  6  M  then  V/(po)  =  0 
at  that  point.  A  point  p0  6  M  is  called  a  critical  point  of  /  if  V/(po)  =  0-  On  the 
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other  hand  we  can  define  the  Hessian  Hf(po)  at  a  critical  point  po  as  a  symmetric 
bilinear  form  such  that  for  any  curve  7 (t)  on  M  with  7(0)  =  p  we  have: 

Hf(p0)  :  TP0M  x  TP0M  ->  R 

Hf(Po)(j(0),7(0))  =  ^(/°T)(0) 

Note  that  due  to  the  symmetry  and  multilinearity  property  of  Hf(p0)  we  only  need 
to  specify  it  at  points  like  (£,£)  E  TP0M  x  TP0M.  A  critical  point  p0  for  which 
the  Hessian  is  positive  definite,  i.e.  Hf(p0)( 7(0),  7(0))  >  0  for  any  curve  on  the 
manifold  passing  through  p0  with  nonzero  velocity,  is  a  local  minimum,  of  /  on 
M.  The  significance  of  this  formulation  is  that  if  in  a  constrained  optimization 
problem  the  constraints  form  a  smooth  manifold  then  we  can  reformulate  the  opti¬ 
mality  conditions  without  constraints  and  avoid  Lagrange  multipliers  formulation. 
As  we  shall  see  later  this  enables  us  to  consider  complicated  constraints  such  as 
orthogonality  or  non-singularity  for  matrices. 

We  mention  that  the  Hessian  is  a  bilinear  form,  which  for  an  n-dimensional  mani¬ 
fold  can  have  a  symmetric  matrix  representation.  If  for  a  critical  point  po  the  corre¬ 
sponding  Hessian  matrix  is  non-singular  then  that  point  is  called  a  non- degenerate 
critical  point.  Non-degenerate  critical  points  are  always  isolated  and  there  are  at 
most  countably  many  of  themfHelmke]. 

4.3  Flows  on  Riemannian  Manifolds 

If  a  moving  object  has  a  velocity  vector  that  is  tangent  to  a  manifold  at  any 
point  then  we  expect  that  the  object  stays  on  the  manifold.  This  is  not  true 
of  course  for  all  times  because  we  can  have  the  finite  escape  time  phenomena, 
which  means  that  the  object  may  leave  the  manifold  in  finite  time.  For  example  if 
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M  =  (0, 1)  then  the  answer  to  the  linear  differential  equation  x  =  1  with  x(0)  =  \ 
leaves  M  at  t  =  An  integral  curve  of  the  smooth  vector  field  A  :  M  — ►  TM 
with  initial  condition  7(0)  =  p  is  a  differentiable  map  7  :  Ip  ^  M  such  that 
7 p(f)  =  A (7 p(t)),Vf  €E  /p where  Ip  is  an  open  interval  containing  0.  If  the  longest 
Ip  is  of  finite  length  then  we  have  the  finite  escape  time  phenomena. 

The  flow  of  A  is  the  collection  of  all  maps  <pt  :  M  — ►  M  such  that  fflp)  =  l{t) 
where  7(f)  is  the  integral  curve  with  the  initial  condition  7(0)  =  p.  Existence  and 
uniqueness  theorems  guarantee  that  4>t  (p)  is  a  diffeomorphism  on-to  its  image  with 
inverse  (<0)-1  =  4>-t  [Helmke] , [Marsden] . 

The  ideas  of  stability  for  ordinary  differential  equations(ODEs)  on  manifolds  are 
similar  to  those  for  ODEs  on  Rn  .  Here  we  state  a  version  of  La  Salle ’s  invariance 
principle  [Khalil]  for  flows  on  manifolds  [Helmke]: 


Theorem  4.1  La  Salle’s  Invariance  principle  Let  X  :  M  —>  TM  be  a  smooth 
vector  field  on  a  Riemannian  manifold  and  let  f  :  M  — >  R  be  a  smooth  function 
such  that  it  has  compact  sub-level  sets,  i.e.  the  set  {p  €  M\V(p)  <  c  Vc  €  R}  is 
compact  and  /(7(0)  =  ®  for  anV  solution  7 (t)  of  7(f)  =  X(^(t)) (*). 

Then  every  solution  of  (*)  stays  on  M  for  all  t  >  0.  Moreover  any  solution 
approaches  the  largest  compact,  connected  and  invariant  subset  of  Q  =  {p  6 
M\(V f,  X(p))p  =  0 }.(f  is  called  a  weak  Lyapunov  function  for  (*)). 

An  important  class  of  flows  on  Riemannian  manifolds  is  the  class  of  gradient 
flows.  If  the  smooth  function  f  :  M  — »  R  has  the  gradient  vector  field  (with 
respect  to  a  certain  Riemannian  metric)  V/  then  the  differential  equation  x  = 
— V/(x)(**)  is  a  flow  in  the  negative  direction  of  the  gradient  in  order  to  minimize 
/.  Obviously  the  critical  points  of  /  are  equilibria  of  the  flow.  In  fact  we  can  see 
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that  f(x(t))  <  f(x( 0))  for  all  t  >  0  that  x(t)  is  defined  on  M.  So  /  is  a  Lyapunov 
function  for  this  flow.  It  can  also  been  shown  that  if  for  a  critical  point  the  Hessian 
is  positive  definite  then  the  point  is  asymptotically  stable  and  if  the  Hessian  has 
at  least  one  negative  eigenvalue  then  that  point  is  unstable.  Using  the  La  Salle’s 
theorem,  we  can  show  that  if  /  has  compact  sub-level  sets  or  M  is  compact  and  if 
/  has  only  finite  number  of  critical  points  then  any  solution  of  (**)  converges  to 
a  single  critical  point  of  /  (point-convergence  as  opposed  to  the  set-convergence 
stipulated  in  the  La  Salle’s  theorem)  and  for  typical  initial  conditions  it  converges 
to  a  local  minimum.  Nevertheless  not  all  functions  have  compact  sub-level  sets  or 
finite  number  of  critical  points.  As  we  will  see  in  the  next  chapters  the  convergence 
properties  of  this  flow  depend  on  both  the  manifold  M  and  the  function  /. 

4.4  Matrix  Lie  Groups 

As  it  was  mentioned  before  the  set  of  non-singular  n  x  n  matrices  GL(n)  is  a 
manifold.  Actually  in  addition  to  being  a  manifold  it  is  also  a  group  under  matrix 
multiplication.  GL(n)  is  an  example  of  a  Lie  group.  A  Lie  group  is  a  manifold 
G  that  has  a  group  structure  consistent  with  its  manifold  in  the  sense  that  the 
group  multiplication  is  a  smooth  map.  In  this  thesis  we  consider  only  matrix  Lie 
groups,  but  it  is  true  that  “most,  though  not  all,  Lie  groups  can  be  realized  as 
matrix  groups”  [HOWE].  There  are  different  ways  to  treat  Lie  groups,  here  we 
choose  a  short  and  informal  path  (mainly  borrowed  from  [HOWE])  and  we  give 
the  results  without  proofs.  This  approach  is  based  on  defining  the  exponential  map 
for  matrices.  For  any  n  x  n  matrix  A ,  exp  A  is  defined  as: 

°°.  Ak 

exp  A  —  eA  —  — 

k= 0 
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Note  that  this  series  converges  for  any  matrix.  Then  the  Lie  algebra  of  matrix 
Lie  group  G  is  defined  as:  0  =  {A  €  Rnxn|  exp(tA)  €  G  for  all  t  €  R}.  It  can 
be  shown  that  g  is  a  vector  space.  Moreover  g  is  closed  under  the  operation  [., .] 
defined  on  g  x  g  as: 


(Ai,  A2)  1— >  [Ai,  A2]  —  AiA2  —  A2Ai 

[., .]  is  called  the  (matrix)  Lie  bracket.  Interestingly,  g  has  a  significant  geometrical 
meaning:  it  is  nothing  but  the  tangent  space  to  G  at  the  identity  matrix  7nxn.  On 
the  other  hand  the  tangent  space  at  any  other  point  B  e  G  can  be  constructed  by 
matrix  multiplication  from  the  elements  of  g  as:  TBG  =  {HA|A  e  g}  =  {A.B|A  € 
g}.  Note  that  although  a  tangent  vector  can  be  produced  by  both  left  and  right 
multiplication  by  B  yet  two  different  elements  in  g  should  be  used  to  produce  the 
same  tangent  vector.  In  subsequent  chapters  we  will  use  the  structure  of  TBG  to 
define  suitable  Riemannian  metrics  that  match  the  group  structure. 

As  for  any  other  manifold,  we  can  define  flows  on  matrix  Lie  groups.  For  example 
one  class  of  flows  over  matrix  manifolds  is  the  exponential  flow  which  corresponds 
to  the  differential  equations  B  =  BA  or  B  =  A B  where  A  6  g  is  a  constant 
matrix.  In  the  next  chapters  we  deal  with  gradient  flows  over  matrix  Lie  groups. 

4.5  Classic  Matrix  Lie  groups 

Here  we  briefly  introduce  the  main  matrix  Lie  groups  encountered  in  this  thesis: 
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4.5.1  The  General  Linear  Group  GL(n) 

In  fact  GL(n)  is  the  “biggest”  matrix  Lie  group  and  all  other  matrix  Lie  groups 
are  its  subsets.  It  is  a  non-compact  Lie  group  of  dimension  n2  and  its  Lie  algebra 
g[(n)  is  R"xn. 

4.5.2  The  Special  Linear  Group  SL(n) 

SL(n)  is  the  group  of  all  nxn  matrices  with  unity  determinant.  It  is  a  non-compact 
Lie  group  of  dimension  n2  —  1  and  its  Lie  algebra  is  sl(n)  =  {A  e  Rnxn|tr(A)  =  0} 
where  tr(.)  is  the  matrix  trace  operator.  To  see  the  relevance  of  this  result  we  cite 
the  identity  det(e^)  =  etr(-A',t. 

4.5.3  The  Orthogonal  Group  O(n) 

0(n)  is  the  group  of  orthogonal  nxn  matrices.  It  is  a  compact  Lie  group  of 
dimension  n^n~1^  and  its  Lie  algebra  is  o(n)  =  {A  €  Rraxn|A  =  —  AT},  i.e.  the 
space  of  n  x  n  skew-symmetric  matrices.  0(n)  has  two  components:  SO(n)  which 
is  the  Lie  group  of  orthogonal  matrices  with  unity  determinant  and  the  other  part 
is  the  set  of  orthogonal  matrices  with  —1  determinant  which  is  not  a  group. 

There  are  other  matrix  Lie  groups  that  we  will  consider  in  Chapter  6,  groups 
such  as  non-singular  diagonal  or  lower  triangular  matrices.  All  the  above  Lie 
groups  can  be  equipped  with  suitable  Rienrannian  metrics  that  match  their  group 
structure.  As  we  will  see  in  next  chapters  we  can  also  derive  gradient  flow  for 
minimization  of  a  cost  functions  over  these  matrix  groups. 
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Chapter  5 


The  Orthogonal  Joint 
Diagonalization  Gradient  Flow 


In  this  chapter  we  derive  the  gradient  flow  for  the  Orthogonal  Joint  Diagonalization 
problem,  and  consider  its  convergence  properties.  We  also  give  another  version  of 
the  JADE  algorithm  based  on  discretization  of  the  gradient  flow. 


5.1  Introduction 

Let  us  consider  the  set  of  n  x  n  symmetric  matrices  {Cio}£Li-  Define  the  cost 
function: 


J\ (0)  =  ||0C'O0T  -  diag(0C'jO0T)||^  (5.1) 

i=l 

where  0  belongs  to  the  group  of  n  x  n  orthogonal  matrices,  i.e.  O(n).  By  min¬ 
imizing  this  function  over  O(n)  we  will  find  the  joint  orthogonal  diagonalizer  of 
C  =  {C)o}i=i  ■  If  {C/o}£i  have  an  orthogonal  joint  diagonalizer  this  minimum  will 
be  zero,  otherwise  it  is  not  zero.  Yet,  because  of  compactness  of  0(n)  as  it  was 
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mentioned  in  Chapter  3  we  are  sure  that  a  minimizer  exists. 

We  follow  and  generalize  the  work  of  [Brockett]  and  [Helmeke]  in  developing  gra¬ 
dient  flows  on  0(n)  for  this  cost  function. 

5.2  The  Gradient  Flow  for  Minimization  of  Ji(0) 

Consider  the  Lie  group  0(n),  at  the  point  0  €  O(n)  and  for  two  tangent  vectors 
£  and  rj  we  define  a  Riemannian  metric  as: 

(£,  v)e  =  tr((0TOT  QTV)  =  tr(£T?l)  (5.2) 

where  tr(.)  is  the  usual  matrix  trace  operator.  Based  on  this  inner  product  we 
then  find  the  gradient  flow  for  the  cost  function  J\. 

Theorem  5.1  Consider  the  cost  function  Ji(0)  then  its  gradient  with  respect  to 
the  Riemannian  metric  (5.2)  is: 

N  N 

VJi(0)  =  -2  0^  [0T{diag(0aO0T)0,Go]  =  -2 2  [diag(0ao0T),  0CiO0T]  0 

i= 1  i= 1 

(5.3) 

where  [ X ,  Y]  =  XY  —  YX  is  the  Lie  Bracket  matrix.  Moreover  the  gradient  flow 
for  minimization  of  this  cost  function  is: 

1  N 

0  =  ~2 V Jr(0)  =  [diag(0C'jO0r),  ©ao0T]  0  (5.4) 

i= 1 

Proof :  In  the  Appendix. 

A  legitimate  question  is  about  the  behavior  of  the  solutions  of  this  gradient  system 
as  t  — >  +oo,  that  whether  0  converges  at  all  or  in  the  case  that  it  converges  to 
a  single  point,  does  it  converge  to  a  local  or  global  minimum?.  Using  the  La 
Salle’s  invariance  principle  (Section  4.3)  observe  that  the  equilibria  of  (5.4)  Q  = 
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{0  €  0(n)|X)il1  [diag(0C'io©T),  0C)o0T]  =  0}  is  exactly  the  largest  invariant 
set  for  which  J  =  0.  Hence  any  solution  stays  on  O(n)  and  converges  to  Ll. 
Still  this  does  not  give  any  information  about  the  nature  of  the  convergence,  i.e. 
whether  it  is  single  point-convergence  or  set-convergence.  We  state  a  theorem  from 
[Mahony]  that  guarantees  point-convergence  for  (5.4).  This  theorem  considers 
gradient  flows  for  analytic  functions,  i.e.  functions  that  have  Taylor  expansion  in 
a  local  coordinate  around  a  point.  In  fact  gradient  flows  of  analytic  functions  have 
simple  behavior,  which  is  described  in  this  theorem: 

Theorem  5.2  Let  J  :  Mn  — ►  R  on  the  smooth  Riemannian  manifold  M  be  an 
analytic  function.  Let  the  Riemannian  metric  at  x  €  M  in  a  local  coordinate  have 
a  symmetric  matrix  representation  Qx.  Assume  that  for  any  point  x  €  M  there 
exists  an  open  neighborhood  Ux  C  M  and  0  <  Aix  <  A2x  such  that  for  all  z  €  Ux  we 
have  Xixlnxn  <  Qz  <  A2 xInxn,  where  Anxn  >  Bnxn  means  that  A  —  B  is  positive 
semi- definite.  Then  for  x(0)  €  M  in  the  gradient  flow  x  =  —VJ(x),  either  x(t) 
leaves  M  or  limt_,+00  x(t)  =  x*  where  x*  is  a  stationary  (critical)  point  of  J. 

Proof:  See  [Mahony]. 

Applying  this  theorem  to  the  gradient  flow  (5.4),  and  noting  the  fact  that  J\  is 
analytic  on  O  (n),  we  conclude  that  for  any  initial  condition  there  exists  a  ©oo  6 
O(n)  such  that  limt_»+00  0(f)  =  0^  and  0^  is  a  stationary  point  of  J\.  Moreover, 
in  general,  we  expect  that  the  flow  will  converge  to  a  local  minimum,  because 
in  a  generic  case  the  Hessian  matrix  is  invertible  and  hence  all  local  minima  are 
asymptotically  stable  and  all  other  critical  points  are  unstable. 

With  regard  to  global  minimality  of  the  answer  we  have  no  proof,  but  we  have  this 
conjecture: 

Conjecture:  In  the  generic  case  (for  example  when  all  the  matrices  are  generated 
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randomly)  all  the  local  minimizers  of  Ji(0)  where  0  G  0(n)are  global  and  they 
are  unique  up  to  a  row  permutation. 

Note  that  the  above  conjecture  for  N  =  1  is  equivalent  to  the  fact  for  a  symmetric 
matrix  with  distinct  eigenvalues  orthogonal  diagonalizers  are  unique  up  to  row 
permutation. 

Assuming  the  above  conjecture  we  can  expect  that  if  0(0)  is  not  a  stationary 
point  then  the  solution  will  converge  to  a  global  minimum.  Extensive  simulations 
support  this. 


5.3  The  Double  Bracket  Equation 


We  can  also  find  differential  equations  that  govern  the  evolution  of  the  matrices 
under  diagonalization  {0Cio0T}()li-  We  consider  the  original  matrices  as  initial 
conditions  and  define  Cj(t)  =  0(t)Cio0T(t)  as  the  matrices  under  diagonalization 
in  time.  Note  that  we  require  0(0)  =  Inxn  so  that  C*( 0)  =  C^.  The  next  theorem 
gives  a  set  of  differential  equations  known  as  Double  Bracket  equations  [Helmke] 
that  describe  the  evolution  of  Ci(t )  in  time. 


Theorem  5.3  Let  0(f)  G  ()(n)  with  0(0)  =  Inxn  satisfy  the  gradient  flow  for 
minimizing  J\  as  in  the  previous  theorem,  then  each  of  the  matrices  under  diago¬ 
nalization,  i.e.  Cj(t.)  =  0(t)CjO0T(f)  satisfies: 

r  N  n 


Ci  = 


^[diag(C'j),  Ci\,Cj 


l<j<N 


(5.5) 


L  i= 1 


Proof :  In  the  Appendix. 

The  Double  Bracket  equation  can  be  shown  to  be  a  gradient  flow  of  the  cost 
function  /({CjjAj  =  Y^i=\  1 1  C'i  —  diag(C'i )  1 1  with  respect  to  a  Riemannian  metric 
known  as  Normal  Riemannian  metric  over  the  manifold  M (C)  =  {(Ci, ...,  Cjv)|Ci  = 
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QCio@T,  0  6  0(n)}  [Helmke].  We  follow  [Helmke]  to  show  this  fact.  Let  C  E 
M(C),  then  we  can  show  that  the  tangent  space  at  C  =  (Ci,...,Cjv)  to  M(C) 
is  TcM(C )  =  {([fi,  Ci], [fi,  Cjv])|fi  €  o(n)},  where  o(n)  is  the  set  ofnxu 
skew-symmetric  matrices,  i.e  the  Lie  algebra  of  O(n).  Obviously  the  mapping 
fi  E  o(n)  i— >  ([fi,  Ci], [fi,  Cjv])  is  a  linear  map  that  is  not  1-1  over  its  image 
TcM(C).  The  kernel  of  this  map  is  K  =  {Slg  o(n)|([fi,  Ci], ...,  [fi,  Cjv])  =  0}.  The 
orthogonal  complement  of  K,  defined  as  I\ ^  =  {fi  E  o(n)|tr(fiTA)  =  0,  VA  € 
K}  is  in  1-1  correspondence  with  TqM(C)  and  any  fi  E  o(n)  can  be  written  as 
fi  =  fill  +  fi1-  where  fill  6  K  and  fi1-  €  K1.  A  useful  observation  is  that  for  any 
symmetric  matrix  Hnxn ,  the  matrix  He  =  ([H,Ci],  ...,[H,Cn])  belongs  to  iW, 
because  tr(LZjfi)  =  tr(iL[fi,  C,])  =  0  for  any  fi  €  K.  In  the  next  step  we  define 
an  inner  product  between  £i,£2  E  TcM(C )  in  terms  of  pre-images  of  £i,£2  in  /Lx, 
indeed  we  can  define  the  Normal  Riemannian  metric  as: 

(6,6)c  =  tr(fi^Tfi^ ) 

where  fi]1,  fi{  E  K ^  are  such  that  £*  =  ([fi*,  Ci], ...,  [fi*,  C^r])  for  i  =  1,2.  Note  that 
for  any  fi1-  E  K±  the  directional  derivative  of  /  along  f  =  ([fix,  Ci], ...,  [fi-1,  Cn})  is 
given  by  Df  |5  =  —  tr(fi-LT2^^=1[diag(C'j),  C*]).  By  the  properties  of  the  gradient 
vector  with  respect  to  the  Normal  Riemannian  metric  we  can  see  that  V/  = 
([A,  Ci], ...,  [X,  Civ])  for  some  X  E  K1-  such  that  Df j  =  tr(fi-LX),  therefore  X  = 
— 2X)ili[diag(Cj),Ci].  So  we  state: 

Theorem  5.4  The  flow: 

1  /  N  N 

(Cuo,CN)  =  -2 V/  =  (  [5^[diag(Cj),  Cj], Ci] , [ ^[diag(Cj),  C,J,  Cjy] 

'  i= 1  i= 1 

which  is  exactly  the  flow  in  (5.5)  is  a  gradient  flow  for  minimization  of  f  on  M(C ) 

with  respect  to  the  Normal  Riemannian  metric  defined  above. 
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5.4  Discretization  of  the  Gradient  Flow 

The  subject  of  discretization  of  differential  equations  on  manifolds  and  groups  is 
a  new  and  challenging  field,  it  is  also  known  as  structure  preserving  integration 
or  geometrical  integration  [Devore].  The  difficulty  lies  where  we  should  keep  the 
updated  answer  always  on  the  manifold.  For  example,  for  (5.4)  we  should  have 
updates  that  give  a  Ok  €.  O(n)  for  any  integer  k.  It  is  shown  in  [Calvo]  that 
only  implicit  Runge-Kutta  methods  can  give  orthogonal  updates,  and  in  general 
an  implicit  Runge-Kutta  is  costly  to  solve  and  it  is  difficult  to  find  exact  solutions 
for.  Thus  we  should  expect  to  have  methods  that  can  retain  orthogonality  only 
approximately,  this  is  in  contrast  to  the  JADE  algorithm  or  Jacobi  method  that 
retains  orthogonality  by  construction. 

In  the  sequel  we  will  use  the  Euler  and  fourth  order  (Explicit)  Runge-Kutta  dis¬ 
cretization  methods  to  discretize  (5.4).  The  Euler  method  is  very  simple  and  be¬ 
cause  of  its  immediate  relation  to  the  gradient  descent  method  is  considered.  The 
Euler  and  Runge-Kutta  methods  have  no  measures  to  keep  the  updates  on  O(n), 
although  by  small  enough  step-size  we  can  achieve  almost  orthogonal  answers.  We 
will  use  fixed-step  size  schemes  mainly  because  they  are  simple  to  implement  and 
that  we  require  small  step-size  more  than  anything  for  having  almost  orthogonal 
answer.  We  also  briefly  introduce  two  other  methods,  the  adjoint  equation  method 
[Calvo]  and  Cayley-transform  methods  [Iserles]  that  try  to  retain  orthogonality  in 
updates,  but  are  more  complicated  and  computation  demanding. 
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5.4.1  The  Euler  Discretization  Method 


The  Euler  method  or  the  first  order  Runge-Kutta  method  to  discretize  (5.4)  results 
in: 

N 

ek+1  =  (7  -  /Ik  [diag(0feC'/o©D>0fcC'/o@r])@fe  =  (7-//fcAfe)0fc  k  >  0  (5.6) 

i=  1 

where  /ik  is  the  step  size  for  the  kth  update  and  0O  is  an  orthogonal  initial  condition. 
In  (5.6),  A k  due  to  the  definition  of  the  Lie  bracket  is  a  skew-symmetric  matrix, 
so  if  we  assume  Qk-i  is  orthogonal  we  have: 

Vk  =  0fe0fc  -1  =  1-  ^fc(Afc  +  Afc )  +  fi2k AkAk  -  I  =  fi2k AfcA^ 

where  T>k  is  called  local  orthogonality  error ,  given  the  assumption  that  Qk-i  is 
orthogonal.  We  can  see  that  the  local  orthogonality  error  is  of  the  second  order 
in  fik.  It  should  be  noted  that  the  Euler  method  is  exactly  the  same  as  gradient 
descent  method,  so  at  each  step  for  small  enough  step  size  the  cost  function  can 
be  reduced  by  this  method. 

One  important  issue  is  how  to  choose  the  step  size  /j,k  at  each  step  so  that  we 
have  stable  (bounded)  and  fast  converging  answer  which  has  small  orthogonality 
error.  In  numerical  optimization  it  is  usually  desired  to  find  the  largest  step  size 
that  gives  the  largest  reduction  in  the  function  value,  on  the  other  hand  here  we 
also  have  the  orthogonality  constraint  which  requires  small  step  sizes.  Therefore 
we  encounter  two  opposite  requirements  in  choosing  the  step  size.  It  seems  that 
algorithms  to  find  the  best  step  size  (fulfilling  both  mentioned  opposite  goals)  will 
be  very  costly.  Hence  in  practice  we  try  to  find  a  small  and  constant  step  size  that 
achieves  both  goals.  Evidently  this  step  size  depends  on  many  factors,  such  as 
the  dimension  of  the  matrices,  the  number  of  matrices  and  the  relative  scaling  of 
the  matrices.  In  fact  almost  all  numerical  optimization  and  integration  methods 
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Algorithm  5.1 

1.  Set  //  and  e. 

2.  Set  0O  =  Inxn  or  to  a  “good”  initial  guess. 

3.  While  ||Afc||j?  >  e  do 

if  ||0fe+ij]F  or  ||0fe+i0^+1  —  Inxn ||f  are  “frig”  then  reduce  /i  and  goto  2. 

4.  End 

Table  5.1;  A  fixed-step  size  implementation  of  the  Euler  discretization  of  the  gradient  flow  (5.4),  which  is 
re-written  in  the  form  ©  =  A ©(t)0(t).  The  unspecified  parameters  and  qualities  are  to  be  decided  in  practice. 


require  some  step  size  tuning  that  depend  on  the  nature  of  the  data  and  should 
be  found  by  trial  and  error.  On  the  other  hand  it  is  rather  easy  to  find  out  if  the 
chosen  step  size  is  bad,  i.e.  if  ||0fo||F  is  becoming  “big”  or  far  from  orthogonal  then 
this  means  that  the  guessed  /i  is  not  appropriate.  Therefore  we  ought  to  reduce 
fi  and  re-start  the  algorithm.  Of  course,  it  is  possible  to  devise  more  elaborate 
adaptive  methods. 

Table  (5.1)  depicts  this  algorithm  in  pseudo  code  (see  equation  (5.6)). 

Remark:  In  the  context  of  BSS  a  “good”  initial  guess  for  0O  is  in  fact  the 
transpose  of  an  eigen  matrix  corresponding  to  any  single  cumulant  slice. 

In  the  above  code  there  are  still  some  un-specihed  things  such  as  how  to  choose 
the  initial  /i  and  e  and  what  do  we  mean  by  “big”  and  how  to  reduce  fr  in  case 
of  observing  instability  or  large  orthogonality  error.  These  are  issues  that  depend 
on  the  data  at  hand  and  the  desired  accuracy  in  the  solution.  In  Section  5.6  we 
consider  the  effect  of  these  parameters  in  practice. 
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5.4.2  Runge-Kutta  (RK)  Methods 

Runge-Kutta  methods  are  well  known  numerical  integration  methods  [Stuart]. 
Here  we  first  introduce  the  general  form  of  Runge-Kutta  methods  for  discretization 
of  X  =  f(X),  where  X(t)  €  Rnxra  and  /  :  Rnxn  — »  Rnxn  is  a  smooth  enough 
vector  held.  Then  the  classical  s  stage  (RK)  method  is  written  as: 

S 

Y  =  Xk  +  i  =  l,...,s  (5.7) 

3= i 

S 

Xk+1  =  Xk  +  fi  bif(Yi),  X0  =  X(0)  (5.8) 

i= 1 

where  /i  is  called  step-size,  1^’s  are  called  the  stage- values.  Numbers  dy  form  a 
matrix  4  and  6,:’s  form  a  vector  b.  b  is  such  that  X()=i  h  =  1.  If  (Mj  =  0  for  i  <  j 
then  the  RK  scheme  is  called  explicit  otherwise  it  is  called  implicit.  The  order  of  a 
RK  method  is  related  to  the  accuracy  of  the  first  step  error  E\  =  X{fj)  —  Xi,  that  is 
the  method  is  of  rth  order  if  E\  =  F(X0)p,r+1  +  0(^r+2),  for  some  matrix  function 
F  where  ||0(^r+2)||F  <  Kpr+2  for  small  h  and  a  constant  matrix  K.  Hence  an  s 
stage  method  is  not  necessarily  of  order  s.  Note  that  the  Euler  method  described 
above  is  a  one  stage  and  first  order  method. 

It  should  be  noted  that  the  above  formulation  is  a  vector  space  based  formulation, 
so  obviously  if  it  is  applied  to  a  differential  equation  on  a  manifold  again,  the 
issue  of  staying  on  the  manifold  arises.  In  [Calvo]  it  is  shown  that  no  explicit  RK 
method  can  retain  orthogonality.  Moreover  in  [Calvo]  it  is  shown  that  if  an  rth 
order  method  is  applied  for  a  differential  equation  of  the  form  0  =  F(0)0  (*) 
where  0(0)  eO(n),  then  the  local  orthogonality  error  defined  as  Vk  =  Qk@k  —  I  is 
such  that  ||T>fc||i?  <  Lkp.r+1  for  small  enough  /j  and  a  constant  Lk.  Hence  we  should 
expect  that  a  higher  order  method  will  result  in  a  better  orthogonality  error  for 
the  same  value  of  step-size.  In  [Calvo]  a  method  of  adjoint  equations  is  introduced 
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that  achieves  the  above  bound  for  an  explicit  RK  method  of  order  r  —  1. 

Another  possible  idea  is  to  use  the  Cayley  transform-  The  Cayley  transform 
of  ©  60(n)  is  defined  as:  Q  =  Cay(0)  =  \{I  —  ©)(/  +  0)_1,  it  is  a  skew- 
symmetric  matrix  and  belongs  to  o(n)  and  the  inverse  Cayley  transform  given  by 
0  =  (/  —  |n)_1(/—  ^17)  belongs  to  0(n).  The  Cayley-transform  methods  [Iserles] 
for  discretization  of  (*)  are  based  on  the  idea  of  solving  (advancing)  the  counterpart 
of  (*)  on  the  Lie  algebra  at  any  point  and  again  transforming  the  answer  back  to 
the  group  0(n).  Note  that  because  the  Lie  algebra  is  a  linear  space  we  have  no 
difficulty  in  keeping  the  answer  in  it.  For  more  detail  the  reader  is  referred  to 
[Calvo] . 

One  point  to  mention  is  that  the  underlying  problem  in  our  case  is  an  optimization 
problem  for  which  we  derived  a  gradient  descent  ODE.  In  this  context  one  may 
want  to  have  the  property  that  at  each  step  of  discretization  the  cost  function  is 
reduced,  the  same  as  standard  gradient  descent  methods.  As  we  mentioned  the 
Euler  method  obviously  has  this  property,  that  is  for  some  /jq ,  any  //  <  /j.q  results 
in  an  update  that  reduces  the  cost  function  and  moreover  the  stationary  points  of 
the  discretized  equation  and  the  discrete  algorithm  are  the  same.  On  the  other 
hand  whether  the  RK  methods  posses  this  property  is  not  clear  in  general.  In 
[Stuart  p.519]  it  is  shown  that  under  the  additional  assumptions  that  the  gradient 
of  the  cost  function  is  globally  Lipschitz  and  radially  unbounded  this  is  true  for 
RK  methods  in  R".  Hence  we  may  expect  this  to  hold  to  a  good  extent  on  other 
manifolds  locally.  Nevertheless,  this  form  of  discretization  is  not  popular  in  the 
optimization  community. 
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Table  5.2:  A  fixed-step  size  implementation  of  the  fourth  order  RK  discretization  of  the  gradient  flow  (5.4), 
which  is  re-written  in  the  form  ©  =  A©@  =  /(©).  The  unspecified  parameters  and  qualities  are  to  be  decided  in 
practice. 


Fourth  Order  RK  Method  Discretization  for  0 

One  of  the  most  popular  RK  methods  is  the  classical  four-stage  fourth  order  explicit 
RK  method  for  which  A  is  such  that  a^\  —  =  \  ,  U43  =  1  and  all  other 

elements  of  A  are  zero  and  b  is  [|,  |]T.  We  use  this  scheme  to  discretize  (5.4). 

We  will  use  a  fixed  small  step-size  to  implement  this  method.  Again  we  re-write 
(5.4)  in  the  form  0  =  Aq0  =  /(0).  The  derived  algorithm  is  shown  in  Table  (2). 
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5.5  Applications  in  the  BSS/ICA  Problem(EG- 
JADE  and  RKG-JADE  Algorithms) 

We  can  use  the  methods  shown  in  Tables  (5.1)  and  (5.2)  to  construct  gradient  based 
versions  of  the  JADE  algorithm,  simply  by  replacing  the  Jacobi  based  joint  diag- 
onalization  by  the  algorithms  developed  in  previous  section.  We  call  the  method 
using  the  Euler  method  Euler- Gradient  based  JADE  or  the  EG-JADE  algorithm 
and  the  other  one  Runge-Kutta  Gradient  based  JADE  or  RKG-JADE.  These  two 
methods  retain  orthogonality  in  0  only  approximately.  However,  in  the  case  of 
noisy  ICA  or  when  using  estimated  cumulants  in  the  JADE  algorithm,  we  know 
that  after  whitening  the  data  the  separating  matrix  is  not  an  orthogonal  one  nec¬ 
essarily  or  exactly,  therefore  the  constraint  of  keeping  the  update  orthogonal  can 
be  compromised  (see  Section  7.1  for  more  detail).  In  other  words  for  the  ICA/BSS 
problem  we  can  tolerate  slightly  off-O(n)  answers  of  (5.4).  Of  course  we  will  lose 
the  compactness  property  of  O(n)  which  is  a  theoretical  guarantee  for  the  conver¬ 
gence  of  flow  (5.4). 


5.6  Numerical  Examples 

In  this  section  we  examine  the  performance  of  Algorithms  5.1  and  5.2  in  the  con¬ 
text  of  joint  diagonalization  and  ICA/BSS  problems. 

Example  1:  In  this  example  we  generate  N  =  100  symmetric  20  x  20  (n  =  20) 
matrices  of  the  form  C)  =  UAiUT,l  <  i  <  N  where  A*  are  diagonal  matrices 
with  elements  that  have  i.i.cl  standard  Gaussian  distribution.  We  find  {Ci}’s 
joint  diagonalizer  using  Algorithm  5.1.  In  the  first  set  of  experiments  we  set 
/i  =  .0001,  .0002,  .0006,  .0012  and  run  the  algorithm  to  observe  the  behavior  of 
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Example  1 :  Variation  of  perfromance  with  respect  to  the  step  size  (N=100,n=20) 


Figure  5.1:  This  figure  shows  variation  of  some  performance  measures  computed  in  Example  (1)  for  different 
values  of  step  size  fr.  a.  log ( 1 1 V Ji(Ojc)||)  b.  log(Ji(Oj;))  c.  log (Index(Pk))  d. H'Dfc ||  e.  det(Ofc) 


different  quantities  (we  do  not  use  any  stop  criteria,  but  run  the  iteration  for 
k=1000  times).  Figure  (5.1)  shows  the  variation  of  some  quantities  in  terms  of  the 
number  of  iterations  k.  The  first  graph  depicts  log(VJ(0fc)).  Because  of  the  large 
dynamic  range  of  the  graph  we  used  logarithm.  Note  that  the  gradient  decreases 
much  faster  and  deeply  for  large  /i's.  The  second  graph  shows  log(Ji(@fc)).  The 
third  graph  shows  how  far  P *.  =  0*,f7  is  from  being  essentially  diagonal,  using  the 
Index  criterion  defined  by  equation  (2.6).  In  fact  it  shows  log (Index(Pk)).  The 
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ideal  value  for  Ji(0oo)  and  Index(Poo)  is  zero.  So  obviously  the  smaller  values  of  // 
result  in  better  performance  although  the  corresponding  gradients  are  not  smaller 
in  norm.  The  reason  will  become  clear  by  looking  at  the  next  two  graphs  where 
they  show  measures  of  orthogonality.  The  fourth  graph  shows  \\T>k\\  and  the  fifth 
one  shows  det(0fc).  These  two  graphs  show  that  the  smaller  values  of  /j  result  in 
better  orthogonality  error  and  so  that  is  why  they  give  better  performance.  Among 
these  values  for  /jl  it  seems  that  n  =  .0006  is  a  good  one  that  gives  a  fair  balance 
between  computational  complexity  and  performance. 

In  the  next  experiment  we  increase  the  dimension  and  number  of  matrices.  Fig¬ 
ure  (5.2)  shows  the  results.  In  this  figure  we  have  four  set  of  curves.  We  consider 
N  =  100,  n  =  40  with  //  =  .0003,  .0001  and  N  =  200,  n  =  40  with  /i  =  .0003,  .0001. 
It  is  interesting  to  note  that  the  convergence  by  increasing  n,  N  or  fi  will  be 
faster  however  the  performance  will  deteriorate,  because  the  orthogonality  error 
increases. 

We  should  mention  that  the  matrices  generated  in  the  above  experiments  were 
somehow  “well”  scaled,  in  the  sense  that  A$’s  generated  have  almost  the  same 
dynamic  range  of  values  and  this  made  the  convergence  of  the  algorithm  much 
better.  Although  this  may  not  be  the  case  in  a  general  JD  problem,  we  can  argue 
that  in  the  BSS/ICA  problem  after  whitening  the  data  because  the  correlation  ma¬ 
trix  is  identity,  the  dynamic  range  of  the  data  is  restricted  so  the  cumulant  slices 
will  have  almost  close  norms.  As  an  example  for  the  case  that  bad  scaling  makes 
things  difficult  we  consider  the  first  experiment  N  =  100,  n  =  20  but  we  consider 
Ci  =  y/i  UAiU  then  algorithm  (5.1)  does  not  converge  for  fi  =  .0001  at  all.  Even 
by  decreasing  [i  to  .000001  still  the  algorithm  does  not  converge.  Hence  in  the  case 
that  the  dynamic  range  of  the  matrices  are  different  this  algorithm  will  not  work 
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Example  1 :  The  effect  of  size  of  matrices  (n=40) 


N=200,g=.0001 

N=200,n=.0003 


N=1 00,|x=.0003 
-  -  N=100,g=.0001 
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Figure  5.2;  This  figure  shows  how  the  performance  changes  when  the  dimension  and  number  of  matrices 
is  increased  in  Example  (1).  N  =  100,  n  =  40  and  N  =  200,  n  =  40  are  used  with  //  =  .0003,-0001.  a. 
l°g(l|VJi(©fc)||)  b.  log(Ji(efc))  c.  log(Index(Pk))  d.||X>fc||  e.  det(0fc) 
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properly,  although  in  the  case  of  BSS  problem  after  whitening  this  is  unlikely  to 
be  the  case. 

Example  2:  In  this  example  we  apply  the  EG-JADE  algorithm  to  a  BSS  problem. 
In  the  first  experiment  we  consider  n  =  4  independent  sources  s  mixed  through  a 
randomly  generated  mixing  matrix: 


A  = 


-1.72577265692312 

0.81319959967304 

1.44186661829232 

0.67227220216042 


0.13866495543565 

-0.85953392675766 

-0.75225055816553 

1.22961508382908 


1.15075435309812 

-0.60802501127072 

0.80615791615996 

0.21713285248002 


-0.37346107032557 

-0.83203043468849 

0.28686630098359 

-1.81889162356406 


One  of  the  sources  has  exponential  distribution  with  zero  mean  and  parameter 
A  =  1,  the  next  one  has  two-sided  exponential  distribution  (also  known  as  Laplace 
distribution)  with  parameter  A  =  1,  the  other  two  sources  have  uniform  distri¬ 
bution  on  [— 1,|].  We  generate  T  =  2000  sample  of  the  source  data  denoted 
by  the  matrix  St  (it  is  a  4  x  T  matrix),  then  mix  the  data  to  get  the  mixed 

signal  samples  Xt  =  ASt ■  We  then  zero  the  mean  of  each  row  of  Xt  and 

_  1 

then  compute  Rxx  =  ^XrX^f.  In  the  next  step  we  find  i?xx  ,  using  the  eigen 

_  1 

decomposition  of  i?xx  and  whiten  the  data  to  get  YT  =  RxxXrl .  Then  using 
the  sample  estimates  of  cumulants  we  compute  the  cumulant  tensor,  and  there¬ 
after  Cy  =  {Cumy(:,:,i,  j) |1  <  i,j  <  n  =  4}.  Note  that  due  to  the  symmetry 
of  cumulant  tensor  we  do  not  need  to  compute  all  the  cumulants,  and  in  fact 
Cum(:,:,i,j)  =  Cum(:,:,j,i).  Next  we  pass  the  set  Cy  through  the  EG-JADE 
algorithm  (Algorithm  5.1)  with  parameters  /j  =  .01  and  e  =  .1.  We  do  not  use  any 

initial  guess  such  as  an  eigen  matrix  of  one  of  cumulant  slices.  This  could  enable 

_  1 

1  Direct  computation  of  i?xx  and  afterwards  i?xx  is  not  the  best  practical  way  to  whiten  the 
data  both  from  computational  load  and  accuracy  points  of  view,  however,  for  our  purposes  and 
due  to  the  good  accuracy  of  computations  in  MAT  LAB®  this  method  is  satisfactory.  For  more 
details  the  reader  can  consult  [Comon  1]  and  the  reference  therein. 
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faster  convergence.  The  algorithm  stops  after  k  =  95  iterations  yielding: 

0.95985004288776  -0.28044871233207  0.00550000868509  0.03592182236959 

0.27822003344298  0.94999801900180  0.14445051912210  -0.03875257025448 

“95  = 

-0.02276020813643  -0.13853724807153  0.78968278035147  -0.63503749191227 

-0.04603839531831  -0.04412894967835  0.63399928692651  0.79979176271809 

with  ||©95©g5  —  IaxaWf  =  0.06586700007617.  The  estimated  un-mixing  matrix 
-1 

B  =  @95  i?xx  is  such  that  the  the  total  mixing  matrix  is  : 

0.03884229547725  0.01947545028173  3.42035143547487  0.09397388342959 

„  „  .  0.02977653938651  -0.01591949619921  -0.16913582939640  3.50898334470220 

P  =  BA  = 

0.05539773531559  -0.72804272424862  0.04057529899273  0.05829470954786 

-0.99678892890767  -0.00922039815672  0.08732420762586  0.10336859023621 

Note  that  it  is  close  to  a  permuted  diagonal  matrix  and  in  fact  its  distance  from 
being  essentially  diagonal  is  Index(P)  =  0.86359255677199.  We  can  give  two  rea¬ 
sons  that  Index(P)  is  not  exactly  zero:  first,  we  are  using  estimates  of  cumulants 
rather  than  exact  values;  second,  we  have  not  proceeded  further  to  minimize  the 
cost  function  (the  effect  of  VJi(095)  not  being  exactly  zero)  and  that  we  have  or¬ 
thogonality  error.  However,  the  first  cause  is  more  harmful.  In  fact  with  e  =  .0001 
and  fi  =  .0001  we  can  have  Index(P)  =  0.74485563657809  after  k  =  35056  itera¬ 
tions!.  Here  we  can  notice  a  drawback  rampant  among  gradient  based  optimization 
methods  (as  opposed  to  Newton  based  methods),  which  is  their  slow  convergence 
when  they  are  close  to  the  answer. 

To  better  understand  the  effect  of  mixing  and  un-mixing  on  the  data  we  plot  com¬ 
ponents  of  the  source,  sensed,  whitened  and  un-mixed  data  with  respect  to  each 
other  in  order  to  visualize  their  dependency.  Figure  (5.3)  shows  these  graphs.  From 
the  boundaries,  especially  in  the  case  of  uniform  variables,  we  may  understand  the 
dependency.  The  first  row  shows  the  independent  source  data.  The  second  row 
shows  the  mixed  data.  The  third  row  shows  the  whitened  data,  we  can  see  that 
still  some  dependency  has  remained.  The  last  row  shows  almost  independent  data. 
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Example  2:  The  components  of  source,  mixed,  whitened  and  restored  signal  with  respect  to  each  other 
SI  and  S2  QQ  .  Q/1 


Figure  5.3)  This  figure  tries  to  help  to  visualize  the  dependence  between  the  components  of  data  during 
each  step  of  the  mixing-unmixing  process  in  the  first  experiment  Example  (2).  Each  graph  is  one  component 
of  data  vector  with  respect  to  another  component,  a)  si  and  S2,  they  are  one-sided  and  two-sided  exponentials, 
respectively.  They  are  independent,  b)  S3  and  S4  and  both  are  uniform.  They  are  independent,  c)  xi  and  X3.  They 
are  dependent,  d)  X2  and  X4.  They  are  dependent,  e)  yi  and  y2-  They  are  uncorrelated  but  still  dependent,  f)  y3 
and  y4.  They  are  uncorrelated  but  still  dependent,  g)  si  and  §2.  They  are  almost  independent (s  =  B As  =  Ps). 
h)  S3  and  S4.  They  are  almost  independent. 
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Note  that  the  element  with  maximum  absolute  value  in  the  last  row  of  P  (this  cor¬ 
responds  to  the  recovered  one-sicle  exponential  source)  has  negative  sign,  that  is 
why  the  graphs  (a)  and  (h)  are  reverse  of  each  other. 

In  the  next  step  we  consider  the  effect  of  changing  fi  on  the  convergence  rate  and 
separation  performance  of  the  algorithm.  The  table  below  shows  the  results. 

We  can  see  that  there  is  not  much  improvement  in  the  separation  performance  by 


k 

Index(P) 

| \Dk  F 

II 

o 

DNC 

- 

- 

II 

o 

CO 

31 

0.90465915694489 

0.02068842693575 

/j  =  .01 

95 

0.86359255677199 

0.00229871410397 

n  =  .005 

191 

0.85442048722669 

5.746785259929237e-004 

H  =  .001 

959 

0.84746473734464 

2.298714103972716e-005 

Table  5.3;  This  table  gives  the  performance  measures  from  applying  the  source  separation  algorithm  developed 
in  this  chapter  for  the  sources  and  mixing  matrix  A  in  Example  (2)  with  respect  to  the  step  size  /i.  Note  that  for 
/x  =  .04  the  algorithm  does  not  converge. 

decreasing  //,  however  orthogonality  error  is  reduced.  As  we  can  see  improvement 
in  orthogonality  error  does  not  improve  the  separation  performance  considerably. 
We  can  conclude  that  the  Euler  discrete  JD  scheme  with  moderately  small  step-size 
results  in  acceptable  separation  performance  with  low  computational  cost  (com¬ 
pared  to  smaller  step-sizes).  Note  that  the  algorithm  does  not  converge  for  /i  =  .04 
and  in  fact  0^  blows  up  very  fast.  This  can  be  related  to  the  fact  that  /i  is  not 
small  enough  to  guarantee  reduction  in  the  cost  function  at  each  step. 

Example  3:  In  this  example  we  will  compare  the  performance  of  the  EG-JADE 
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and  RKG-JADE  algorithms  with  the  JADE  algorithm2.  We  have  no  indication 
that  these  methods  can  perform  better  than  JADE,  because  they  are  different  ver¬ 
sions  of  each  other.  In  fact  because  it  retains  the  orthogonality  by  construction 
and  because  of  its  almost  quadratic  convergence  rate  [Bunse-Gerstner] ,  the  JADE 
algorithm  outperforms  this  gradient  based  method  slightly  in  separation  perfor¬ 
mance  and  largely  in  computational  cost. 

Consider  the  data  model: 

x  =  As  +  an 

where  n  is  the  standard  Gaussian  noise  and  a2  indicates  the  noise  power.  We 
compare  the  performance  of  EG-JADE  and  RK-JADE  with  the  standard  JADE’s 
performance  for  the  same  mixing  matrix  A  and  source  signals  as  in  Example  1. 
We  compute  the  averaged  performance  measure  Index(P)  (see  (2.6))  versus  a  in 
the  model  for  different  values  of  T  =  2000,5000,  e  =  .01  and  /t  =  .01  for  these 
algorithms.  To  calculate  the  average  of  Index(P)  we  repeat  each  experiment  100 
times  for  any  set  of  the  above  parameters. Figure  (5.4)  shows  the  results.  From 
Figure  (5.4)  we  can  see  that  the  three  algorithms  perform  almost  equally  up  to 
moderate  values  of  a.  For  larger  noise  power  the  JADE  algorithm  slightly  out¬ 
performs  the  other  two.  This  result  is  somehow  un-expected.  This  can  mean  that 
the  cost  function  Ji(@)  is  difficult  to  be  minimized  by  gradient  descent  when  the 
matrices  are  not  jointly  diagonalizable.  It  is  interesting  to  note  that  the  Euler  and 
RK  methods  also  have  the  same  performance  in  this  problem. 


2  A  MAT  LAB®  code  for  JADE  was  downloaded  from: 


http:  /  /  tsi.enst.fr /cardoso/icacentral  /  Algos 


The  average  perfomance  Index  in  terms  of  noise  a  for  differentICA  methods  in  Example  3 


Figure  5.4l  This  figure  shows  the  average  performance  index  of  EG-JADE  and  RKG-JADE  with  e  =  fi  =  .01 
and  also  average  performance  index  of  JADE  for  T  =  2000,  5000.  The  mixing  matrix  and  sources  are  the  same  as 
in  the  first  part  of  Example  (2).  Note  that  the  graphs  are  bundled  in  two  groups  corresponding  to  three  different 
values  of  T. 
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Chapter  6 


Gradient  Based  Non-Orthogonal 
Joint  Diagonalization 


In  this  chapter  we  derive  gradient  based  algorithms  for  Joint  Diagonalization  by 
non-orthogonal  matrices  using  the  cost  function  J\  introduced  in  Chapter  3.  We 
develop  methods  to  discretize  them,  such  that  the  answer  has  some  desired  prop¬ 
erties  that  resemble  the  continuous  answer.  We  also  give  a  method  based  on  the 
Armijo  Step  Size  selection  to  implement  the  gradient  flow. 

6.1  Gradient  Flow  for  Joint  Diagonalization  on 
GL(ra) 

Let  {C'i}^=1  be  a  set  of  n  x  n  symmetric  matrices  and  consider  the  cost  function 
introduced  in  (3.5)  as: 


N 

Ji (B)  =  \\BQBt  -  diag(BCiBT)\\2F  (6.1) 

i= 1 
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where  the  un- mixing  matrix  B  belongs  to  GL(n),  the  group  of  non-singular  n  x  n 
matrices.  As  it  was  mentioned  in  Section  3.3,  Ji{B)  is  not  scale- invariant,  i.e. 
Ji(AB)  ^  Ji(B)  for  non-singular  diagonal  A  and  hence  it  is  not  suitable  for  JD. 
We  remind  that  scale-invariance  for  a  cost  function  in  terms  of  un-mixing  matrix 
refers  to  left  multiplication  of  the  argument  by  diagonal  matrices.  Note  that  as 
||A|| p  ►  0,  Ji(AB)  decreases  to  zero.  In  the  next  two  sections  we  will  develop 
methods  to  avoid  such  reductions. 

Consider  the  manifold  GL(n),  at  any  point  B  €  GL(n)  for  tangent  vectors 
£,  7]  €  T^L(n>  we  define  a  Riemannian  inner  product  as: 

(£,  v)b  =  tr {{fB~1)T  =  ti ^(B~T(rr]B~1)  =  tr(r]{BTB)~1fT)  (6.2) 

In  fact  Qb  =  ( BT B)~ 1  is  the  positive  definite  matrix  representing  the  Riemannian 
inner  product  at  the  point  B.  Note  that  £ B -1  belongs  to  the  Lie  Algebra  of 
GL(n),  i.e  g[(n).  We  call  this  Riemannian  metric  the  Natural  Riemannian  metric1, 
because  it  matches  the  group  structure  of  GL(n).  Note  that  we  could  take  (£,  rj)  b  = 
tr((.B_1£)T  B~1i])  as  the  Riemannian  metric,  however,  because  we  are  concerned 
about  left  multiplication  of  B  by  diagonal  matrices  we  are  interested  in  tangent 
vectors  of  the  form  A B  where  A  G  g[(n). 

The  gradient  flow  for  minimization  of  (6.1)  on  GL(n)  corresponding  to  the  Natural 
Riemannian  metric  is  given  in  the  following  theorem: 

Theorem  6.1  :The  Natural  Gradient  of  (6.1)  on  GL(n)  is 

VJi {B)  =  ( BCiBT  -  di&g(BCiBT))BCiBA  B  (6.3) 

^  i=  1  ' 

use  this  name  after  Amari’s  Natural  Gradient  method  which  uses  the  same  metric  (see 
Chapter  2  in  [Haykin]). 
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(6.4) 


and  the  Natural  gradient  flow  for  minimizing  (6.1)  on  GL(n)  is: 

B  =  -(j2  iBCiBT  -  diag  (BCiBT))BCiBT'\  B 

'  i=  1  ' 

with  B( 0)  €  GL(n). 

Proof :  See  the  Appendix. 

The  equilibria  of  the  above  flow  is  found  by  letting  VJi(B)  =  0  or  H  = 
Etr  ( BCiBT  -  diag(BCiBT))BCiBT  =  0.  Therefore: 

tr (H)  =  tr(  ^  (BCiBT— diag(i?C'jf?T))  [BCiBT— di&g(BCiBT)+diag(BCiBT))'\ 
'  1=1  ' 

N 

J^ti ■((BCiBT  -  diag (BCiBT))2)  =  0 

i=l 

where  we  used  the  fact  that  tr((A  — diag(A))diag(A))  =  0  for  any  A.  The  above  re¬ 
sult  means  that  BCiBT  =  diag (BCi,BT)  for  all  1  <  *  <  N.  Hence  if  the  set  {6)}^ 
are  not  diagonalizable,  i.e.  there  is  no  non-singular  B  such  that  all 
are  diagonal  then,  this  flow  does  not  have  any  equilibria  or  Ji(B)  does  not  have 
any  minima  on  GL(?r)!.  This  can  be  related  to  the  fact  that  J\{B)  can  be  reduced 
to  zero  by  a  diagonal  matrix  whose  norm  is  approaching  zero  and  again  shows  that 
Ji{B)  by  its  own  is  not  a  good  cost  function  for  joint  diagonalization,  as  mentioned 
before. 


6.2  Gradient  Flow  for  Joint  Diagonalization  over 

SL(n) 

One  way  to  improve  the  problem  with  non-compactness  of  GL(n)  and  the  scale 
variability  of  J\{B)  is  to  consider  minimization  of  J\{B)  over  the  set  SL(n),  i.e. 
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the  group  of  n  x  n  matrices  with  unit  determinant.  Obviously  SL(n)  is  not  a 
compact  group  and  det(L>)  =  1  does  not  put  any  upper  bound  on  the  norm  of  B , 
but  from  the  SVD  decomposition  B  =  UYiVT  we  have  |det(.B)|  =  det(E)  so  the 
largest  singular  value  is  bigger  than  unity  hence  ||-E>||2  >  1-  By  restricting  B  to  be 
in  SL(?r),  we  identify  all  matrices  of  the  form  aB  for  a  E  R  —  {0}  with  B.  Thus 
we  hope  we  can  avoid  converging  to  the  trivial  infinrum  of  J\  at  B  =  0. 

To  find  the  gradient  flow  of  Ji  over  SL(n)  we  first  state  this  useful  lemma, 
about  finding  the  gradient  of  a  function  on  a  sub-manifold  from  its  gradient  on  the 
manifold  [Helrnke]: 

Lemma  6.1  ;  Let  f  :M— ►  R  be  a  smooth  function  on  a  Riemannian  manifold  M 
and  let  V  C  M  be  a  sub-manifold,  endowed  with  the  Riemannian  metric  induced 
from  M .  If  x  €  V ,  then  the  gradient  of  the  restriction  f\y  :  V  — >  R  is  the 
orthogonal  projection  ofVf(x)  €  TXM  on  the  tangent  space  ofV  at  x,  i.e.  TXV . 

We  state  a  simple  lemma  that  gives  the  orthogonal  projection  of  a  any  matrix  on 
the  linear  space  of  matrices  of  zero  trace,  namely  sl(n). 

Lemma  6.2  ;  The  orthogonal  projection  of  the  matrix  Anxn  on  the  space  of  ma¬ 
trices  with  zero  trace  is  given  by:  A0  =  A  —  t-^p-Inxn- 

Proof:  Obviously  tr(A°)  =  0  and  t,r(A0T(A  —  A0T ))  =  tr(A)tr(A0T)/n  =  0. 
Therefore  A 0  is  the  orthogonal  projection  of  A  to  the  space  of  zero  trace  matrices.  A 
Now  we  are  ready  to  give  the  gradient  flow  of  Ji(B)  on  SL(n).  The  next  two 
theorems  give  the  Natural  gradient  of  J\  on  SL(n). 

Theorem  6.2  ;  The  Natural  gradient  of  J\(B)  on  SL(n)  is: 

VJi(-B)  =  4A  °B  (6.5) 
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where  A0  is  the  orthogonal  projection  of 


A  =  (  BCiBT  -  diag(£> Cj-BT)  )  BCiBT 

i=  1  '  ' 

on  si (n)  from  the  above  lemma.  Furthermore  the  natural  gradient  flow  to  minimize 
Ji(B)  on  SL(n)  is: 

B  =  — VJi(-B)  =  —A°B  (6.6) 

with  B( 0)  €  SL(n). 

Proof:  Just  apply  Lemma’s  6.1  and  6.2. 

For  the  convergence  properties  of  this  flow,  we  resort  to  the  Theorem  5.2.  As 
a  result  we  can  say  that  if  the  solution  to  (6.6)  stays  on  SL(n)  then  it  converges 
to  a  single  critical  point  (more  likely  to  a  local  minimum)  of  Ji  on  SL(n). 


6.3  Nonholonomic  Flow  for  Joint  Diagonaliza- 
tion 

A  more  general  way  to  deal  with  non-compactness  of  GL(n)  and  scale- variability  of 
Ji(B)  is  to  project  the  gradient  onto  an  appropriate  space  in  order  to  restrict  the 
reduction  in  the  cost  function  to  only  desired  directions.  For  instance  we  may  be 
able  to  prevent  undesirable  reduction  in  the  cost  function  due  to  diagonal  matrices. 
In  other  words  we  want  to  constrain  the  flow  such  that  it  does  not  reduce  the  cost 
function  due  to  row  scaling.  For  this  purpose  we  introduce  the  concept  of  group 
actions  and  orbits. 

Remark:  Nonholonomic  constraints  are  constraints  on  a  vector  field  or  velocity 
that  are  not  integrable.  Nonholonomic  treatment  of  the  ICA  problem  is  related 
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to  the  issue  of  scale  indeterminacy  in  the  un-mixing  matrix  and  is  a  well  known 
property  (see  for  example  Chapters  2  and  3  in  [Haykin]  as  well  as  [Amari  2]). 

6.3.1  Group  Action  on  a  Manifold 

Definition  6.1  :  A  (left)  action  of  a  Lie  group  G  with  identity  element  e  on  a 
manifold  M  is  a  smooth  mapping  <f>  :  G  x  M  — >  M  such  that  : 

1.  <t(e,  x)  =  x  for  all  x  E  M, 

2.  <f>(g,  x))  =  $(gh,  x)  for  all  g,h  E  G  and  x  €  M . 

For  any  x  E  M  the  orbit  of  x  is  defined  as  Ox  =  {<E>(g,  x)\g  E  G}  C  M 

Example:  Let  G  be  GL(n)  and  M  the  space  of  n  x  n  matrices.  Then  the  action 
defined  by:  Sim  :  GL( n)  x  M  — ■>  M  with  (S,  X)  SXS _1  is  the  similarity 
transformation  which  preserves  the  set  of  eigenvalues  of  X.  The  orbit  of  this 
action  is  Ox  =  {SAT S'-1  (S'  E  GL(n)},  that  is  all  the  matrices  that  have  the  same 
eigenvalues  as  X. 

It  is  easy  to  see  that  for  any  action  the  relationship  “x  ~  y,  x,  y  E  M  iff  y  E  CV’is 
an  equivalent  relationship  on  M  and  the  orbits  are  its  equivalent  classes.  In  the 
above  example  the  relationship  of  “having  the  same  set  of  eigenvalues”  is  the 
corresponding  equivalence  relationship. 

6.3.2  The  Action  of  the  Group  of  Diagonal  Matrices 

Consider  the  group  of  nonsingular  diagonal  matrices  T>.  Let’s  define  an  action 
<L  :  V  x  GL(n)  — ►  GL(n)  by  ( D,B )  i— ►  DB,  i.e.  left  multiplication  by  a  non¬ 
singular  diagonal  matrix.  The  orbit  of  a  matrix  B  is  the  set  Ob  =  {DB\D  E  V}. 
We  can  show  that  Ob  is  an  n  dimensional  sub-manifold  of  GL(n).  The  tangent 
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space  to  this  sub-manifold  at  B  is  Tbb  =  {A_B|Anxnis  diagonal}.  On  the  other 
hand  the  tangent  space  to  GL(n)  at  B  is  TBLI'n>  =  {A.B|A  €  Rnxn}.  Thus 
Tbb  C  TpjL(nK  The  orthogonal  complement  of  the  of  Tbb  in  TBL^  is  given  the 
next  proposition. 

Lemma  6.3  ;  At  any  point  B  €  GL(n)  the  orthogonal  complement  with  respect  to 
the  Natural  Riemannian  metric  ofTBB  inTBL ^  is  the  setTBB  =  {AL>|diag(Anxn) 
01- 

Proof:  TgB  is  an  n  dimensional  linear  subspace  of  TBL ^  with  bases  {A 
where  A*  is  a  diagonal  matrix  whose  only  non  zero  diagonal  element  is  the  iith  entry 
and  is  equal  to  unity.  On  the  other  hand  TB  B  is  also  a  linear  n2  —  n  dimensional 
subspace  of  TBL^  with  bases  {A for  i  ^  j  where  A^  is  an  n  x  n  matrix 
whose  only  non  zero  element  is  at  the  ijth  position  and  is  unity.  For  any  k  and 
i  ^  j  between  1  and  n  let  ^  =  A ^B  and  r]  =  A jjB,  then  we  have: 

tr  =  tr(A^A^)  =  0 

Hence  every  vector  in  TB  B  is  perpendicular  to  all  vectors  in  TB  B ,  and  because  the 
sum  of  their  dimensions  is  n 2 ,  they  are  orthogonal  complement  with  respect  to  the 
Natural  Riemannian  inner  product.  A 

Using  this  lemma  we  can  project  any  tangent  vector  to  GL(n)  at  B  to  two  orthog¬ 
onal  components,  one  along  Tbb  and  the  other  orthogonal  to  that.  The  direction 

along  TBB  amounts  to  the  direction  in  which  the  cost  function  is  reduced  by  only 

q r 

diagonal  matrices.  Therefore  by  restricting  the  gradient  to  Tbb  we  can  avoid  the 
contribution  of  diagonal  matrices  in  cost  reduction.  Let’s  define  for  any  matrix  A: 

A-1  =  A  —  diag(A)  (6.7) 

We  have  this  theorem,  afterwards: 
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Figure  6.1:  Vector  fields  in  equations  (6. 4), (6. 6)  and  (6.8) 

Theorem  6.3  ;  The  Natural  gradient  flow  for  minimization  of  Ji(B)  restricted 

Q- L 

to  Tbb  is  given  by: 

B  =  -A  ±B  (6.8) 

where  A  =  ^  ( BCiBT  -  diag (BCiBT))BCiBT . 

Proof:  It  is  the  immediate  result  of  theorem  (2)  and  Lemma  (3).  A 
Figure  (6.1)  illustrates  how  the  Natural  gradient  flow  on  SL(n)  (6.6)  and  the  non- 
holonomic  flow  in  (6.8)  are  related  to  the  original  GL(n)  Natural  gradient  flow  of 
Ji(B). 

It  is  interesting  to  verify  whether  the  projected  flow  in  (6.8)  still  gives  a  descent 
flow  or  not.  That  is,  is  the  time  derivative  of  Ji{B)  negative?  We  can  see  from 
the  definition  of  gradient  with  respect  to  the  Natural  Riemannian: 

jx  =  tr((VJi  B-'fBB-1)  =  -tr(ATAx)  =  -tr(A±TAx)  =  -  A?.  <  0 


77 


So  as  long  as  A  is  not  diagonal,  (6.8)  is  a  descent  flow  which  is  a  desirable  property. 
Note  that  if  the  initial  condition  B( 0)  eSL(n),  then  flow  (6.8)  can  be  also  consid¬ 
ered  as  a  flow  over  SL(n),  hence  det  ( B )  =  1  and  ||B||2  >  1. 

6.4  Flows  on  the  Manifolds  of  Triangular  Matri¬ 
ces 

It  is  easy  to  see  that  the  groups  of  n  x  n  non-singular  lower  LL(n)  and  non¬ 
singular  upper  triangular  matrices  UL(n)  are  Lie  groups.  The  sub-group  of  lower 
triangular  matrices  with  all  diagonal  elements  equal  to  unity  SLL(?r)c  LL(n)  and 
the  sub-group  of  upper  triangular  matrices  with  all  diagonal  elements  equal  to 
unity  SUL(n)c  UL(n)  are  also  Lie  groups  of  dimension  -  ra~1^ .  In  fact  they  are 
very  simple  sub-manifolds  of  GL(n)  and  SL(n).  ll(n)  is  the  projection  of  0l(n)  over 
the  space  of  lower  triangular  matrices  and  slt(n)  is  the  projection  of  gl(n)  over  the 
space  of  lower  triangular  matrices  with  zero  diagonal.  The  Lie  Algebras  of  UL(n) 
and  SUL(n)  can  be  found  in  a  similar  way.  Note  that  we  did  not  define  SUL(n) 
as  upper  triangular  matrices  with  unity  determinant  as  we  shall  find  the  presented 
definition  more  useful. 

We  can  easily  find  the  (upper  or  lower)  triangular  counterpart  of  the  derived  flows 
for  minimizing  J\.  Flows  (6.6)  and  (6.8)  are  of  the  form  B  =  —A B  where  A  is 
defined  accordingly.  In  fact  this  is  the  form  of  any  tangent  vector  at  B.  So  by 
projecting  A  to  the  space  of  triangular  matrices  (lower  or  upper)  we  can  find  the 
triangular  (lower  or  upper)  version  of  the  derived  flows.  Let  AL  denote  a  lower 
triangular  matrix  that  has  the  same  lower  triangular  part  as  A.  Then  the  lower 
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triangular  version  of  flow  (6.8)  is 

B  =  -A ±LB,  B( 0)  €  SLL(n)  (6.9) 

with  the  same  A  has  defined  in  the  Theorem  6.3.  The  corresponding  upper  trian¬ 
gular  version  is  derived  in  the  same  manner. 

The  main  reason  for  introducing  flows  on  triangular  matrices  is  that  triangular 
versions  of  the  nonholonomic  flow  (6.8)  are  suitable  for  discretization,  in  the  sense 
that  it  is  easy  to  have  updates  of  lower  or  upper  triangular  matrices  that  produce 
unity  determinant  sequence  of  matrices,  as  we  shall  see  in  the  next  section. 

6.5  Discretization  of  the  Flows 

As  for  discretization  of  flows  over  O  (n)  in  Chapter  5,  here  also  we  have  the  problem 
of  keeping  the  updates  on  the  manifold.  For  flows  introduced  so  far  this  translates 
to  having  updates  with  unity  determinant.  In  this  section  we  develop  some  meth¬ 
ods  to  discretize  the  flows  introduced  in  the  previous  sections.  Here  we  first  give 
Euler  and  explicit  fourth  order  Runge-Kutta  discretization  of  the  flows  (see  Sec¬ 
tion  5.5).  After  that  we  will  use  the  idea  of  LU  decomposition  of  the  un-mixing 
matrix  to  derive  a  method  that  keeps  the  updates  on  the  manifold  by  construction. 
Finally  we  incorporate  the  Armjio  line  search  to  Euler  discretization. 

As  we  mentioned  all  the  flows  developed  so  far  have  either  the  form: 

B  =  -AB,  B{  0)  =  Inxn  (6.10) 

6.5.1  Euler  Discretization 

Using  the  Euler  method  for  (6.10)  we  have  the  update: 

Bk+1  =  (Inxn  -  /j,A k)Bk  k>  0  (6.11) 
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Algorithm  6.1 

1.  Set  /j  and  e. 

2.  Set  B0  =  Inxn  or  to  a  good  initial  guess. 

3.  While  ||Afc||j?  >  e  do 

d^k+l  ( ^  A I'^k) 

if  ||Bfc+i||i?  is  ubig ”  then  reduce  ji  and  goto  2. 

4.  End 


Table  6 . 1 !  A  fixed-step  size  implementation  of  the  Euler  discretization  of  the  gradient  flow  (6.10),  which 
represents  all  the  continuous  flows  developed  based  on  the  Natural  Riemannian  metric.  The  unspecified  parameters 
and  qualities  are  to  be  decided  in  practice. 


where  A^  is  computed  according  to  the  corresponding  flow  at  each  step  and  // 
is  a  small  enough  step-size  so  that  Bk+i  is  on  the  desired  manifold  (i.e  having  unity 
determinant)  and  B k  is  bounded.  Table(6.1)  shows  Algorithm  6.1  that  represents 
this  discretization.  Note  that  we  have  not  provided  any  measure  to  check  the 
determinant  of  B k,  as  it  is  very  costly.  Only  by  choosing  small  /j  we  hope  we  can 
achieve  this  goal  to  a  good  extent.  However  checking  whether  Bk  is  blowing  up  and 
the  algorithm  is  diverging  is  easy.  If  //  is  small  enough  such  that  Ji(Bk+ 1)  <  J\(Bk) 
then  we  have  a  descent  algorithm.  Therefore  for  two  reasons  of  keeping  the  update 
on  the  manifold  and  descent,  we  need  to  have  small  step-size.  We  emphasize  that 
in  an  optimization  problem  on  a  vector  space  only  the  second  restriction  appears 
and  as  long  as  we  have  a  descent  we  can  choose  large  step  sizes. 

6.5.2  Fourth  Order  Runge-Kutta  Discretization 

In  this  section  we  proceed  much  the  same  way  as  Section  5.5  to  derive  RK  dis¬ 
cretization  for  the  developed  flows.  Considering  equation  (6.10)  we  expect  that  the 
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update  equation  for  Bk  in  an  RK  scheme  will  be  in  the  form  of  Bk+ 1  =  {I+ii^k)Bk 
where  the  matrix  is  computed  using  stage  values  in  the  RK  method.  Based  on 
the  ideas  used  in  developing  our  flows  we  may  want  to  impose  some  conditions  on 
'Iq.  such  that  the  discrete  scheme  more  resembles  the  continuous  one.  In  particular 
for  a  nonholonomic  flow  we  can  require  dT  to  have  zero  diagonal  and  for  a  gradient 
flow  on  SL(n)  we  project  dK  to  Note  that  will  not  have  these  properties 
immediately  contrary  to  the  matrix  in  the  Euler  method.  Table  (6.2)  shows  an 
algorithm  based  on  explicit  fourth  order  RK  method  that  takes  this  measure  in  to 
account.  The  underlying  flow  can  be  on  SL(?r),  nonholonomic  or  flow  on  triangular 
matrices  manifold.  Note  that  for  flows  over  SLL(n)  and  SUL(n)  we  do  not  require 
to  alter  dq.,  because  it  will  have  the  required  properties  by  construction. 

6.5.3  An  iterative  algorithm  based  on  LU  factorization 

Here  we  introduce  an  iterative  algorithm  based  on  LU  [Golub]  factorization  of  the 
un-mixing  matrix  B  and  the  nonholonomic  flow  introduced  before.  The  idea  is 
to  have  alternating  nonholonomic  flows  over  SLL(n)  and  SUL(?r).  Consider  the 
nonholonomic  flow  when  B{t)  is  confined  to  be  in  SLL(?r),  i.e  (6.9).  Note  that 
both  the  Euler  and  RK  discretization  will  have  the  form  Bk+i  =  (/  +  fj>k&^L)Bk, 
where  L  is  a  matrix  whose  non-zero  elements  are  below  the  diagonal.  Therefore 
det  Bk+ 1  =  det  Bk  and  if  B0  =  Inxn  then  det  Bk  =  1  for  all  k  by  construction, 
moreover  all  the  diagonal  elements  of  Bk  are  one.  The  same  holds  if  B  is  confined 
to  be  in  SUL(n)  and  B0  =  Inxn.  By  combining  these  two  approaches  we  can  have 
an  iterative  algorithm  that  alternatively  searches  for  upper  and  lower  triangular 
factors  of  the  un-nrixing  matrix  and  keeps  the  determinant  unity  by  construction. 
An  algorithm  for  this  is  presented  in  Table  (6.3). 
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Algorithm  6.2 

Consider  any  one  of  the  flows  (6.6),  (6.8)  or  (6.9).  Let  them  be  written  in 
the  general  form  B  =  AgB  where  is  found  for  each  of  them  accordingly. 

1.  Set  /i  and  e. 

2.  Set  B0  =  Inxn  or  to  a  good  initial  guess. 

3.  While  ||Afc||f  >  e  do 

Li  =  Bkl  Y2  =  (I  +  |  At  A  Yi)Bk  =  &Y\Bk 

I3  =  (/  +  |a tAy^yJ-Bfe  =  &y2Bk,  Y4  =  (I  +  fiA Y3&Y2)Bk  =  &Y3Bk 
|  A  Yy  +  \Ay2(l>Yl  +  |  A  Y3®Y2  +  |  A  y4$y3 
If  discretizing  (6.6)  then  <—  4/° 

If  discretizing  (6.8)  then  <—  'L?Jr 

Bk+i  =  (I  +  l^k)Bk 

if  ||5fc+i||i?  is  “ big ”  then  reduce  n  and  goto  2. 

4.  End 

Table  6.2:  A  fixed-step  size  implementation  of  the  fourth  order  RK  discretization  of  the  flow  (6.10),  which 
can  represent  all  the  continuous  flows  developed.  The  unspecified  parameters  and  qualities  are  to  be  decided  in 
practice. 


The  most  interesting  feature  of  this  algorithm  is  that  det(Lfc)  =  det(C4)  =  1  for 
all  k  independent  of  /Jk  and  hence  det(L>)  =  1  by  construction.  This  is  a  nice 
property,  because  it  guarantees  that  B  is  non-singular  and  \\B\\2  >  1. 

6.5.4  Incorporation  of  the  Armijo  line  search  method 

The  Armijo  line  search  method  is  a  well  known  line  search  method  to  find  step  size 
for  descent  methods  [Tits].  We  can  incorporate  this  method  in  the  above  algorithm 
(when  the  Euler  method  is  used  for  discretization),  knowing  the  fact  that  using 
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Algorithm  6.3 

Consider  the  set  {C'i}A^1  of  symmetric  matrices.  Let  (a):t7  =  —A±uU  and  (b): 

L  =  —A ±LL  with  17(0)  =  L(0)  =  I  be  the  corresponding  upper  and  lower 
nonholonomic  joint  diagonalization  flows.  (See  equations  (6.8)  and  (6.9)) 

1.  Use  Algorithm  6.1  or  6.2  to  find  U  the  solution  to  (a). 

2.  Set  Ci  <-  UCiUT. 

3.  Use  Algorithm  5.1  or  5.2  to  find  L  the  solution  to  (b) 

4.  Set  Ci  <-  LCiLT. 

5.  Set  B  ^  L  UB 

6.  If  \\LU  —  I\\f  is  “small”  stop  else  goto  1 

Table  6.3:  The  pseudo  code  for  an  algorithm  for  joint  diagonalization  of  { Cj  } 7  p  based  on  the  LU  factorization 
of  the  un-mixing  matrix.  The  nonholonomic  flows  corresponding  to  L  and  U  are  discretized  using  Algorithm  5.1 
or  5.2.  The  algorithm  alternatively  finds  U  and  L  and  forms  B  from  them  sequentially.  The  main  feature  of  this 
algorithm  is  that  det  B  =  1.  This  is  a  nice  property,  because  it  guarantees  that  B  is  non-singular  and  ||S||2  >  1. 
The  unspecified  parameters  and  qualities  are  to  be  decided  in  practice. 


the  above  algorithm  the  updates  are  always  on  SL(n)  independent  of  step-size. 
The  basic  idea  in  the  Armjio  method  is  to  change  the  step-size  fik  =  ft  where 
(3  €  (0, 1)  by  changing  j  and  keeping  f3  constant  to  find  the  smallest  j  such  that 
Ji(Bk+i)  —  Ji(Bk)  <  aft  J\{Bk)  where  a  €  (0, 1)  and  Bk+1  is  updated  according  to 
any  descent  direction.  Note  that  J\{Bk)  is  the  directional  derivative  in  the  descent 
direction.  The  derived  algorithm  is  coded  in  Table  (6.4). 
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Algorithm  6.4 

Consider  the  set  {C;}^  of  symmetric  matrices. 

1.  Set  B  =  Inxn ,  sef  ol  €  (0, 1). 

2.  U\  =  Inxn  do  until  “convergence”: 
find  the  smallest  nonnegative  integer  j 

such  that  Ji(C4+i)  —  Ji(Uk)  <  afttr(Aj;UT  Ajft)  where:  £4+i  =  (/  —  ftA^u)Uk 
and  (t4  C*  U k  —  diag(C4  Cj  Uk))Uk  CiUk  and  increment  k. 

3.  Let  U  be  the  result  of  the  previous  step,  set  C,t  <—  UCiUT . 

4.  L1  =  Inxn  do  until  “convergence”: 
find  the  smallest  nonnegative  integer  j 

such  that  Ji(Lfe+1)  —  Ji(Lk)  <  afttr(  AftT  Aft)  where:  Lk+1  =  (I  —  ft  A  kL)Lk 
and  Ak  =  (Lfc  C,  L £  —  diag(Lfc  C*  ift) )  Lk  CiJft  and  increment  k. 

5.  Let  L  be  the  result  of  previous  step,  set  C\  <—  LCiLT . 

6.  Set  B  «-  LUB. 

7.  If  \\LU  —  I\\f  is  “small”  stop  else  goto  2 

Table  6.4:  The  pseudo  code  for  an  algorithm  for  joint  diagonalization  of  { Cr }  jL , ,  based  on  the  LU  factorization 
of  the  un-mixing  matrix  and  Armijo  line  search.  The  nonholonomic  flows  corresponding  to  L  and  U  are  discretized 
using  the  Euler  method.  By  construction  det (Uk)  =  det(Lfc)  =  1,  so  the  Armijo  method  can  be  used  without  any 
restriction  on  for  keeping  updates  on  SL(n).  The  algorithm  alternatively  finds  U  and  L  and  forms  B  from  them 
sequentially.  The  main  feature  of  this  algorithm  is  that  det  B  =  1.  This  is  a  nice  property,  because  it  guarantees 
that  B  is  non-singular  and  \\B\\2  >  1.  The  unspecified  parameters  and  qualities  are  to  be  decided  in  practice. 
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Chapter  7 


ICA/BSS  Algorithms  Based  on 
Joint  Diagonalization 


In  this  chapter  we  develop  ICA/BSS  algorithms  based  on  joint  diagonalization 
methods  developed  before.  We  introduce  hybrid  methods  that  whiten  the  data, 
however  do  not  restrict  the  subsequent  search  space  to  orthogonal  matrices. 

7.1  Introduction 

In  Chapter  5  we  gave  a  gradient  based  version  of  the  JADE  algorithm  for  BSS/ICA. 
In  this  chapter  we  will  develop  more  general  gradient  based  algorithms  for  BSS/ICA 
using  methods  developed  in  the  previous  chapters.  These  algorithms  do  not  restrict 
the  search  space  to  orthogonal  matrices.  That  is,  they  search  for  non-singular  joint 
diagonalizer  for  a  set  of  cumulant  matrix  slices. 

Consider  the  ICA  model: 

x  =  As  +  n  =  z  +  n  (7.1) 

with  standard  assumptions,  especially  with  n  being  Gaussian  noise.  In  lack  of 
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information  about  noise  or  z’s  covariance  matrix  we  choose  to  whiten  the  data 
based  on  the  (estimated)  covariance  matrix  of  x.  Suppose  W  is  a  matrix  that 
whitens  or  shperes  x,  then  for  the  white  signal  y  we  have: 

y  =  VFx  =  WAs  +  Wn  =  His  +  rf)  (7.2) 

This  can  be  considered  as  a  new  problem  with  the  same  structure  as  before  where 
A\  =  W A  is  non-singular  and  hi  is  again  a  Gaussian  noise.  Obviously  the  form 
of  the  problem  has  not  changed  and  we  have  not  lost  any  information.  Yet  this 
new  data  is  closer  to  independence  “in  most  cases”  (see  Section  2.4.4).  This  is 
the  traditional  uncorrelating  step  used  in  different  contexts,  but  obviously  we  shall 
proceed  further!. 

Note  that  when  the  noise  is  not  very  strong  we  expect  the  reduced  un-mixing 
matrix  namely  A1  =  W A  to  be  an  almost  orthogonal  matrix.  This  is  an  important 
fact  because  in  practice  it  makes  the  subsequent  nonorthogonal  joint  diagonaliza- 
tion  much  easier.  We  elaborate  more  on  this.  As  a  special  case  consider  the  case 
where  the  covariance  matrix  of  n  in  (7.1)  has  the  form  Rnn  =  o2Inxn.  Then  it  is 
not  difficult  to  show  that  (with  the  convention  that  Rss  =  Inxn)  for  the  2-norm  of 
orthogonality  error  we  have  that  (in  the  first  order  of  approximation): 

WA^T  -  InXnh  <  a(—)2  (7.3) 

^ min 

where  vmin  is  the  smallest  singular  value  of  A  and  a  is  a  positive  constant.  This 
means  that  if  the  noise  power  is  not  strong  compared  to  the  smallest  singular 
value  of  A  the  matrix  Ai  =  W A  in  (7.2)  is  close  to  orthogonal.  However,  if  we 
assume  the  orthogonality  and  impose  that  on  the  subsequent  joint  diagonalization 
phase,  obviously  we  are  reducing  the  degree’s  freedom  in  the  search  and  abandon 
further  possible  joint  diagonalization.  Hence  we  should  perform  non-orthogonal 
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joint  diagonalization.  Note  that  an  orthogonal  flow  has  the  form  0  =  A0  where 
A  is  skew-symmetric,  so  we  have  only  degrees  of  freedom  in  updating  0 

whereas  the  same  scheme  for  a  nonholonomic  flow  can  have  n(n  —  1)  degrees  of 
freedom.  This  has  the  advantage  that  makes  more  minimization  possible  but  on 
the  other  hand  the  compactness  of  O(n)  is  lost.  As  we  mentioned  before  and  as 
actual  simulations  show  the  JD  methods  developed  in  Chapter  6  work  much  better 
in  the  case  the  sought  matrix  is  close  to  orthogonal.  To  motivate  this  we  recall  that 
if  the  initial  condition  for  in  the  JD  algorithm  is  the  identity  and  the  sought  matrix 
is  close  to  orthogonal  the  required  change  in  the  norm  is  not  significant.  Note  that 
this  can  be  improved  if  we  initialize  the  JD  algorithm  with  a  better  guess,  which 
can  be  for  example  the  eigen  matrix  for  one  of  cumulant  slices.  However  we  will 
not  choose  this  method  in  our  simulations. 

Considering  (7.2)  and  using  the  properties  of  cumulants  and  cumulant  slices  we 
have  that  Cumy(:,  :,i,j )  =  A-\  where  A y  is  a  diagonal  matrix  from  Lemma 

2.1  (see  also  equation  (3.2)).  Now  we  can  search  for  a  non-singular  matrix  B  that 
jointly  diagonalizes  {Cumy(:,:,i,j)}i,j  using  methods  developed  in  Chapter  6  for 
non-orthogonal  JD.  To  be  rigorous  we  mention  that  this  deduction  was  based  on 
the  assumption  that  the  exact  correlations  and  cumulants  of  the  data  are  avail¬ 
able.  Mainly  the  assumptions  that  Cumni(:,:,i,j)  =  0  and  that  Cumy(:,:,i,j) 
are  diagonalizable  by  Ai  are  based  on  using  exact  cumulants.  Certainly  deviation 
from  these  assumptions  in  small  sample  sizes  will  affect  the  performance  of  the 
algorithms. 

There  are  other  benefits  to  whitening  the  data,  for  example  the  whitened  signal 
usually  is  such  that  its  cumulant  matrix  slices  have  the  same  dynamic  range  and 
again  this  helps  the  numerical  stability.  We  should  also  mention  that  covariance  is 
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the  most  reliable  statistics  (compared  to  higher  order  statistics),  in  the  sense  that 
it  has  the  least  amount  error  estimation.  Therefore  it  is  convincing  to  use  as  much 
information  as  we  can  from  covariance.  We  mention  that  there  are  methods  that 
rely  only  cumulants  (see  for  example  [De  Lathauwer2] ) ,  in  this  view  these  methods 
are  missing  the  nice  properties  of  second  order  statistics. 

In  summary,  in  order  to  do  ICA,  we  try  to  reduce  the  mutual  information 
“globally  or  coarsely”  at  the  first  step  by  whitening  the  data  and  then  by  joint  di- 
agonalization  of  cumulant  slices  we  reduce  the  mutual  information  further  “locally 
or  finely” .  This  way  we  use  the  benefits  of  white  data,  higher  order  statistics  and 
the  group  structure  of  both  orthogonal  and  non-singular  matrices. 

7.2  A  Class  of  ICA  Algorithms  Based  on  Non- 
Orthogonal  JD 

Here  we  introduce  a  general  scheme  for  our  algorithms.  Assume  that  T  realizations 
of  data  in  (7.1)  with  standard  assumptions  are  available.  We  place  these  realiza¬ 
tions  column- wise  in  an  n  x  T  matrix  Xt-  So  we  can  also  write  Xt  =  ASt  +  Nt- 

The  Whitening  step  WHT  consists  of  computing  Rxx  =  ^XXT  and  then  W  = 
1 

RXx  to  find  y  =  Wx  or  YT  =  WXT.  In  practice  we  always  use  vectors  after  we 
have  made  their  mean  zero.  The  cumulant  computing  step  CUM  is  simply  com¬ 
puting  the  fourth  order  cumulants  of  Yt  using  sample  estimates.  We  can  take  any 
subset  of  the  computed  cumulant  slices.  The  number  of  cumulant  slices  is  n2,  so 
for  large  n  using  all  slices  maybe  prohibitive.  Let  {Ci\R=1  be  the  subset  chosen. 
To  proceed  further  we  adopt  the  idea  of  sequentially  minimizing  the  cost  func¬ 


tion: 


N 

Ji(B,  {Cijti)  =  E  II BCiBT  -  diag(JBC'iJBT)||^ 

i= 1 

in  a  multiplicative  fashion.  That  is,  first  we  find  an  orthogonal  0  to  minimize 
Ji  (we  call  this  step  JDO)  and  we  recompute  (7*  <—  0(7j0r  (we  call  this  step 
RCO)  and  then  again  minimize  J\  (this  time  with  a  new  of  course)  by 

a  non-orthogonal  matrix  BJDN  (step  JDN).  Thereafter  B  <—  BJDNQB.  We  can 
proceed  further  and  again  recompute  (7*  <—  BJDNCiBjDN  (step  RCN)  and  repeat 
the  previous  steps.  The  reason  for  orthogonal  joint  diagonalization  is  two-fold. 
First  as  it  was  mentioned,  after  whitening  we  expect  the  un-mixing  matrix  to 
be  close  to  an  orthogonal  one;  second,  because  of  compactness  of  O(n)  and  the 
theoretical  assurance  of  convergence  of  the  gradient  flows  on  O(n)  (presented  in 
Section  5.2)  we  may  want  to  perform  orthogonal  JD.  Note  that  the  SL(n)  flow 
developed  in  Section  6.2  is  a  gradient  flow  on  a  non-compact  set,  and  also  the 
nonholonomic  flows  developed  in  Section  6.3  are  not  flows  on  a  compact  set.  Notice 
thatsteps  JDO-RCO-JDN  together  are  equivalent  to  JDN  when  it  is  initialized  to  the 
initial  condition  Bjdno  =  ©•  This  is  obviously  the  result  of  the  group  structure  of 
the  problem.  Another  point  to  be  mentioned  is  that  for  all  the  algorithms  developed 
in  the  previous  chapters  both  the  initial  conditions  and  answers  are  matrices  with 
unity  determinant,  so  an  orthogonal  matrix  can  be  used  as  the  initial  condition 
for  them.  Of  course,  despite  this  we  can  avoid  JDO-RCO  and  just  do  JDN  with 
BjDNO  A  x  n  • 

Another  approach  maybe  to  update  the  data  after  any  of  the  JD  procedures  and 
then  recompute  the  cumulants  from  the  data  samples,  this  way  flow  the  data  along 
with  the  un-mixing  matrix.  While  this  approach  seems  plausible  it  is  much  more 
costly  and  in  practice  does  not  offer  performance  increase,  as  it  was  experienced. 
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Table  (1)  summarizes  the  steps  mentioned  above  and  possible  algorithms  that 
can  be  used  in  each  of  them.  Note  that  in  step  RCO  we  have  included  the  original 
Jacobi  based  Algorithm  3.1  for  orthogonal  JD,  and  also  in  both  RCO  and  JDN  we 
can  use  any  other  ODE  solver,  such  as  the  popular  MAT  LAB®'  s  “ode45”  routine. 


WHT:  Whiten  the  data,  let  B  =  W  be  the  whitening  matrix  and  y  the  whitened 
data. 

CUM:  Compute  C  =  {C'i}^=1  a  subset  of  forth  order  Cumulant  matrix  slices  for  y. 

JDO:  Jointly  Diagonalize  C  by  an  Orthogonal  matrix 

using  any  of  Algorithms  3.1,  5.1,  5.2  or  any  other  ODE  solver. 

RCO:  Re-Compute  Cj  <—  O Ci<dT  and  B  <—  QB ,  where  O  is  the  (Orthogonal) 
answer  to  step  JDO. 

JDN:  Jointly  Diagonalize  C  by  a  Non-orthogonal  matrix  using  any  of  Algorithms 
6. 1,2, 3, 4  or  any  other  ODE  solver. 

RCN:  Re-Compute  C*  <—  BJDNCiBTjDN  and  B  <—  BJDNB,  where  BJDN 
is  the  (Non-orthogonal)  answer  to  step  JDN. 

Table  T.l:  The  core  steps  used  in  ICA/BSS  algorithms  for  model  (7.1). 

The  usual  JADE,  EG-JADE  or  RKG-JADE  algorithms  can  be  coded  as: 

1.  do  WHT 

2.  do  CUM 

3.  do  JDO  and  RCO 

4.  If  not  satisfactory  go  to  step  3 

We  can  generalize  this  to  a  class  of  algorithms  in  the  form: 
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1.  do  WHT 


2.  do  CUM 

3.  do  (JDO  and  RCO)  and  do  (JDN  and  RCN) 

3’.  do  (JDN  and  RCN) 

4.  If  “not-satisfactory”  goto  3  or  3’  else  END 

Table  7.2:  The  general  scheme  for  a  class  of  ICA/BSS  algorithms  based  on  the  JD  criterion. 

Note  that  step  3  has  two  versions.  In  the  first  version  an  orthogonal  and  a  unity 
determinant  matrix  are  sought  and  in  the  second  version  only  a  unity  determinant 
matrix  is  found.  As  we  mentioned  they  can  be  equivalent  if  we  initialize  JDN  with 
the  answer  of  JDO.  In  practice  we  refrain  from  repeating  steps  3  or  3’  because  it  is 
unlikely  to  be  fruitful  unless  the  stop  criteria  chosen  is  large  so  that  major  further 
improvement  was  possible. 

7.3  Numerical  examples 

In  this  section  we  shall  consider  applying  the  developed  ICA/BSS  algorithms  to 
some  BSS/ICA  problems.  As  it  is  evident  we  have  developed  a  class  of  algorithms 
that  can  have  lots  of  members.  For  implementation  of  each  step  we  can  have  dif¬ 
ferent  algorithms  with  different  parameters  (  for  example  ji  and  e  in  all  discretized 
gradient  or  the  number  of  iterations  in  Algorithm  6.3).  Hence  comparing  all  these 
methods  will  be  tedious  and  even  not  very  useful,  because  these  algorithms  are 
to  work  with  stochastic  data,  hence  their  performance  also  depends  on  the  source 
probability  distribution  as  well.  Therefore  we  try  to  give  few  examples  that  cap- 
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ture  different  aspect  of  the  algorithms. 


Example  1 :  Let  us  consider  x  =  Lis  +  an  where  s  is  a  n  =  5  dimensional  ran¬ 
dom  vector  for  which  the  first  element  is  exponentially  distributed  by  parameter 
A  =  1  and  mean  one,  the  second  one  has  the  two  sided  exponential  distribution  and 
the  last  three  are  uniformly  distributed  in  [— ),  |],  and  n  is  a  standard  Gaussian 
noise  and  a2  indicates  the  actual  power  of  noise  at  each  sensor.  The  mixing  matrix 
is  randomly  generated  as: 


-0.42438571607 

-1.03974356765 

0.85444609063 

0.20440694544 

-1.42107846008 

-0.9210520223 

-0.00242328109 

-1.40278316492 

-0.54986431848 

-0.12122193071 

A  = 

1.01156524988 

0.20298091664 

-0.12445806045 

0.39315798257 

0.56608725333 

-0.35265942639 

0.74153122094 

0.43137720844 

-0.84647982436 

-1.62052834709 

-0.27103430011 

0.05900889251 

0.59195584499 

1.29681290480 

0.17889884891 

We  generate  T  =  2000  independent  samples  of  x  with  a  =  0.  For  the  non- 
orthogonal  JD  step  in  this  example  we  only  consider  nonholononric  flows  (equation 
(6.8)  and  not  gradient  flow  on  SL(n)  (equation  (6.6).  In  the  first  experiment  we 
apply  steps  1,2,3  from  Table  (2)  using  JDO  and  JDN  by  the  Euler  method  (Algo¬ 
rithms  5.1  and  6.1,  respectively)  with  //  =  .01  and  e  =  .01.  Next  we  do  the  same 
by  using  fourth  order  RK  method  (Algorithms  5.2  and  6.2)  with  the  same  values 
of  /r  and  e.  After  that  we  use  the  LU  based  methods  with  both  Euler  and  RK 
methods.  We  set  e  =  .0001  and  use  the  same  /r  to  replace  the  Euler  JDN  by  an 
Euler  LU  JDN  step.  Note  that  decreasing  e  is  necessary  now  because  the  matrix 
that  its  norm  is  considered  for  stop  criteria  is  a  triangular  matrix.  We  repeat  the 
LU  steps  5  times.  Next  we  do  step  3’  only,  that  is  we  only  do  JDN  and  we  forget 
about  an  orthogonal  diagonalizer.  Table  (7.3)  shows  the  result  of  these  experi¬ 
ments.  We  have  computed  the  performance  index  Index(P)  for  P  =  BOW  A(see 
equation  (2.6)).  Note  that  when  we  do  only  JDN  then  0  =  Inxn  in  computing  P. 


92 


We  also  computed  det  B  and  ||00T  —  I\\f  as  measures  of  staying  on  the  manifolds. 
Note  that  all  methods  perform  almost  equally.  Of  course  this  can  not  be  a  exact 
way  of  comparing  them.  Next  we  repeat  the  same  experiments  by  adding  noise  to 


Index(P) 

det(.B) 

||00T  -  I\\F 

Euler  JDO/JDN 

1.0319 

0.9998 

0.0397 

RK  JDO/JDN 

1.0326 

0.9999 

0.0235 

Euler  JDO/LU-JDN 

1.0376 

1 

0.0397 

RK  JDO/LU-JDN 

1.1262 

1 

0.0235 

Euler  JDN 

1.081 

1.0065 

- 

RK  JDN 

1.1159 

1.0072 

- 

Table  7.3;  This  table  gives  the  performance  measures  from  applying  ICA/BSS  methods  schemed  in  Tables 
(1)  and  (2)  to  Example  1,  when  cr  =  0.  The  parameters  for  each  method  are  specified  in  Example  1.  All  the 
discretized  flows  are  nonholonomic  ones. 


x  with  cr  =  .1.  Table  (7.4)  shows  the  results  for  the  same  parameters  as  before. 
Obviously  Index(P)  has  increased  by  noise.  Another  point  is  that  the  required 
time  to  perform  the  computation  is  more,  which  shows  that  the  problem  becomes 
more  difficult  in  noise. 

Example  2:  In  this  example  we  again  consider  the  same  mixing  matrix  A  as  in 
Example  2  and  the  same  sources.  We  compare  the  performance  of  four  differ¬ 
ent  methods  in  noise:  steps  (1,2,3)  with  Euler  discretization,  steps  (1,2,3)  with 
RK  discretization  steps  (1,2,3)  but  this  time  using  the  MAT  LAB®  ODE  solver 
“ode45”  which  uses  a  complicated  adaptive  step  size  in  RK  method,  and  lastly 
the  JADE  algorithm.  All  flows  used  are  nonholonomic  flows.  We  use  two  values 
T  =  2000,  5000  for  sample  size.  We  use  p  =  .001  and  e  =  .01  for  all  Euler  and 
RK  methods.  Figure  (1)  shows  their  performance.  The  plots  show  that  the  de- 
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Index(P ) 

det(R) 

||00T  -  I\\F 

Euler  JDO/JDN 

1.3874 

0.9998 

0.0389 

RK  JDO/JDN 

1.3848 

0.9998 

0.0230 

Euler  JDO/LU-JDN 

1.4725 

1 

.0389 

RK  JDO/LU-JDN 

1.3872 

1 

0.0230 

Euler  JDN 

1.4140 

1.0046 

- 

RK  JDN 

1.414 

1.0072 

- 

Table  7.4:  This  table  is  the  same  as  the  previous  one,  except  that  a  =  .1. 


veloped  algorithms  outperform  JADE.  On  the  other  hand  the  experiments  show 
that  the  computation  time  for  these  methods  is  much  longer  then  JADE’s.  Hence 
the  better  performance  of  the  developed  algorithms  is  at  the  expense  of  more  com¬ 
putation  time.  Interestingly  the  fix-step  size  methods  perform  on  average  slightly 
better  than  “ode45”  but  still  at  the  expense  of  more  time.  Yet  we  should  say 
that  a  MAT  LAB®  implementation  in  M-files  does  not  give  a  good  assessment  of 
the  computation  cost  of  these  methods,  because  the  developed  algorithms  are  only 
loops  of  addition-multiplication  operations,  and  this  can  not  be  implemented  via 
MAT  LAB®  interpreter  in  an  efficient  manner.  One  advantage  of  the  developed 
JD  methods  is  that  they  only  require  additions  and  multiplications  and  no  divi¬ 
sion,  so  they  are  suitable  for  fixed  point  implementation  on  DSP  processors. 

Example  3:  In  this  example  we  compare  the  performance  of  nonholonomic 
flow  and  gradient  flow  on  SL(n)  for  ICA.  That  is  we  perform  steps  WHT,  CUM, 
JDN.  For  the  JDN  step  we  use  both  Euler  and  RK  discretizations  of  equations  (6.6) 
and  (6.8).  We  apply  these  algorithms  (Algorithm  6.1  and  6.2)  to  the  data  produced 
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The  average  performance  Index  of  4  different  ICA  methods  with  respect  to  noiseo  for  T=2000,T=5000 


Figure  7.1:  (Example  2)This  figure  shows  the  average  in-noise-performance  index  (every  point  is  averaged 
over  100  trials)  of  algorithms  of  the  form  WHT,  CUM,  JDO-RCO,  JDN  implemented  in  three  methods:  Euler,  RK, 
and  “ode45”  function  in  MAT  LAB® .  The  standard  JADE  algorithm  is  also  used.  These  algorithms  are  applied 
to  the  data  introduced  in  Example  1  in  the  presence  of  noise.  The  average  Index(P)  is  plotted  versus  <r.  We 
consider  two  sample  sizes  T  =  2000,  5000.  The  parameters  for  the  discretized  algorithms  are  set  to  fi  =  .001  and 
e=  .01. 
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The  performance  of  ICA  for  SL(n)  and  Nonholomoc  ICA  algorithm  of  the  form  WHT,  CUM,  JDN  in  noise 


Figure  7.2;  (Example  3)  This  figure  shows  the  average  in-noise-performance  index  (every  point  is  averaged 
over  100  trials)  of  algorithms  of  the  form  WHT,  CUM,  JDN  based  on  Nonholonomic  flow  on  GL(n)  and  gradient 
flow  on  SL(n),  discretized  in  both  Euler  and  RK  methods.  These  algorithms  are  applied  to  the  data  introduced 
in  Example  1  in  the  presence  of  noise.  The  average  Index(P)  is  plotted  versus  a .  We  consider  two  sample  sizes 
T  =  3500  and  all  implementation  have  the  same  parameters  n  =  .01  and  e  =  .01.  Note  that  no  orthogonal  JD 
(JDO)  is  applied. 


in  the  same  way  as  in  Example  (1).  We  set  /j,  =  e  =  .01  for  all  methods  and  we  use 
T  =  3500  samples.  Figure  (7.2)  shows  the  average  performance  index  with  respect 
to  noise  a.  From  the  figure  it  is  understood  that  SL(n)  and  nonholonomic  flows 
perform  almost  the  same  in  this  problem. 
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Chapter  8 


Summary  and  Suggestions  for 
Future  Work 


In  summary,  after  introducing  necessary  tools  and  concepts  we  developed  orthog¬ 
onal  and  non-orthogonal  flows  for  joint  diagonalization  of  a  set  of  symmetric  ma¬ 
trices.  This  derivation  was  based  on  defining  a  suitable  Riemannian  metric  that 
matches  the  group  structure  of  the  underlying  Lie  group  (i.e.  0(n)  and  GL(n)). 
In  the  case  of  orthogonal  JD,  the  cost  function  Ji(0)  in  equation  (3.3)  is  mini¬ 
mized  with  respect  to  0  £  0(n)  resulting  in  the  flow  in  equation  (5.4).  Due  to 
compactness  of  the  group  O(n)  this  cost  function  has  a  minimum  on  0(n)  and 
hence  the  gradient  minimization  of  it  is  a  legitimate  process  whereas  in  the  case 
of  non-orthogonal  JD  the  cost  function  J\{B)  with  B  eGL(?r)  does  not  have  this 
property.  J\{B)  can  be  reduced  by  diagonal  B  with  small  norm  and  it  is  not  ac¬ 
ceptable  in  the  context  of  ICA/BSS.  On  the  other  hand  as  we  showed  in  Chapter 
6  unless  all  the  symmetric  matrices  to  be  diagonalized  have  an  exact  common  di- 
agonalizer  Ji(B)  has  no  minimum  on  GL(n)  and  its  only  minimizer  is  the  trivial 
B  =  0  which  of  course  is  not  in  GL(n).  To  combat  this  (or  to  achieve  some  sort 
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of  compactification  of  GL(n)),  we  followed  the  idea  of  identifying  B  and  its  mul¬ 
tiples  to  keep  ||-B||f  large  enough  so  that  the  minimization  becomes  non-trivial. 
If  we  identify  aB,a  -  {0}  with  B  then  we  will  have  the  gradient  flow  (6.6), 
which  is  a  flow  on  SL(n).  If  we  identify  A B  where  A  is  a  non-singular  diagonal 
matrix  we  will  have  flow  (6.8)  which  is  a  nonholonomic  flow  with  respect  to  the 
Natural  Riemannian  metric  on  GL(n).  These  two  flows  were  derived  by  projecting 
(restricting)  the  gradient  of  Ji{B)  on  (to)  appropriate  spaces.  Instead  of  this  pro¬ 
jection  we  could  use  another  cost  function  introduced  in  equation  (3.7),  and  this 
can  be  one  idea  for  further  work. 

We  also  gave  local  point  convergence  proofs  for  the  orthogonal  and  SL(n)  gradi¬ 
ent  flows.  In  the  case  of  the  nonholonomic  flow  still  there  is  room  for  investigating 
convergence  properties  of  the  flows. 

Proper  discretization  of  flows  on  a  manifold  is  not  a  trivial  task.  There  are 
sophisticated  methods  to  do  this,  we  did  not  follow  this  path,  instead  we  used 
Euler  and  fourth  order  Runge-Kutta  methods  with  suitable  modification  and  small 
enough  step-size  to  have  discretization  schemes  that  stay  on  the  manifold  to  a  good 
extend.  However,  we  gave  an  iterative  discretization  scheme  for  flow  (6.8)  based  on 
the  LU  decomposition  of  B.  This  method  has  the  property  that  for  the  resultant 
B ,  det  B  =  1  regardless  of  the  step-size  used.  We  also  did  not  seek  adaptive 
step-size  selection  schemes  in  discretization.  This  can  be  a  possible,  but  yet  little 
bit  dubious  path  for  further  work.  These  gradient  based  algorithms  are  sensitive 
to  scaling  mismatches  in  data.  Proper  pre-conditioning  can  be  very  useful  in  the 
general  case,  for  as  we  mentioned  in  the  case  of  the  developed  ICA/BSS  algorithms 
the  matrices  to  be  diagonalized  (i.e.  cumulant  slices  of  a  white  signal)  are  in  general 
properly  scaled. 


The  JD  algorithms  based  on  gradient  are  very  slow,  as  other  gradient  methods 
are,  specially  if  the  level  curves  of  the  cost  functions  are  not  nice,  and  this  is  the 
case  when  the  matrices  are  far  from  having  a  joint  diagonalizer.  Therefore  a  very 
major  development  would  be  to  devise  Newton  based  methods  for  this  optimiza¬ 
tion  problem,  that  take  the  group  structure  and  the  issue  of  non-compactness  of 
GL(n)  in  to  account.  For  the  case  of  orthogonal  JD  the  so-called  JADE  algorithm 
(Algorithm  3.1)  [Cardosol]  is  a  simple  and  fast  solution. 

Based  on  the  developed  flows  we  proposed  a  class  of  ICA/BSS  algorithms  that 
whiten  the  data  first  but  do  not  confine  the  search  for  an  un-mixing  matrix  to 
O(n)  as  the  JADE  algorithm  does.  In  a  more  elegant  language  we  can  say  that  the 
developed  algorithms  solve  the  ICA  problem  in  two  steps:  first  by  a  whitening  step 
we  get  closer  to  an  independent  answer  ( globally )  and  then  by  looking  for  a  non- 
orthogonal  joint  diagonalizer  for  the  fourth  order  cunrulant  slices  of  the  whitened 
signal  we  locally  solve  the  ICA  problem.  The  celebrated  JADE  algorithm  does  the 
same  except  that  the  sought  matrix  is  confined  to  be  orthogonal.  Our  methods 
are  also  different  from  only-HOS  based  methods  because  we  use  the  second  order 
statistics  to  whiten  the  data.  Our  argument  for  doing  this  is  that  in  general  a 
signal  becomes  closer  to  independence  by  whitening  (see  Section  2.4.4  for  more 
details).  We  neither  gave  a  precise  statement  of  this  fact  nor  a  proof,  which  can  be 
a  subject  for  further  work  as  well.  It  should  be  noted  that  by  whitening  the  data  we 
will  not  lose  any  information  due  to  the  group  structure  of  the  ICA/BSS  problem. 
Moreover  whitening  has  the  advantage  that  scales  the  data  properly  for  further 
processing  (a  concept  comparable  to  pre-conditioning  in  numerical  analysis).  This 
is  also  a  vague  statement  that  can  be  made  more  rigorous  and  investigated  further. 

We  applied  the  developed  methods  to  few  cases  of  synthetic  data,  a  more  general 
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pace  would  be  to  examine  the  performance  of  these  core  ICA/BSS  methods  in 
more  sophisticated  and  specific-purpose  ICA/BSS  schemes.  The  non-orthogonal 
JD  algorithms  could  also  be  examined  in  the  context  of  non-stationary  BSS  where 
a  set  of  correlation  matrices  are  to  be  jointly  diagonalized  [Phaml] . 
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Appendix  A 


Derivations  and  Some  Proofs 


In  some  cases  we  will  use  these  easy-to-prove  identities  about  n  x  n 

A,  B,  C: 

tr  (ABC)  =  tr(CAB)  =  tv(BCA) 
tr(A  diag(L>))  =  tr(diag(/L)  B)  =  t,r(diag(yl)  diag (B)) 
and  if  A  and  B  are  symmetric  the  Lie  Bracket  is  skew-symmetric: 

[A,  B]T  =  ( AB  -  BA)t  =  BA-  AB  =  -[A,  B] 

A.l  Proof  of  Theorem  5.1 

We  name  each  single  term  in  the  summation  as  Jj(0)  that  is: 

N  N 

J\ (0)  =  E  Ji  =  E  ll0C,*o©T  -  diag(0C'o0T)||^ 

i= 1  i=  1 

then  note  that  using  the  above  identities  and  the  symmetry  of  C,o  and 
and  orthogonality  of  0  we  have: 

Ji  =  tr  ((0ao0r  -  diag^Go©7))  (0ao0T  -  diag^Qo©7)))  = 


matrices 

(A.l) 

(A.2) 

(A.3) 

(A.4) 

0Cio0T 
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tr(Cf0)  -  2tr(0C'jO0r  diag(0C'iO0T))  +  tr(diag(0C'iO0r)  diag(0C'iO0r))  = 

tr(C* )  -  tr(0C'iO0T  diag(0C'o0r))  (A.5) 

In  the  next  step  we  compute  the  time  derivative  of  Ji,  i.e.  J*.  By  using  (A.1,A.2) 


we  have: 

Ji  =  -2tr (0CiO0T  diag(0ao0T))  -  2tr(0C'o©T  diag(0CiO0T))  (A.6) 

As  it  was  mentioned  in  Chapter  4  we  can  write  0  =  0A  where  Anxn  is  a  skew- 
symmetric  matrix.  With  this  in  mind  we  continue  (A.6)  as: 

ji  =  -2tr(0A CmQt  diag(0CiO0T))  -  2tr(0CiOAT0T  diag(0C'iO0T))  =  (A.7) 
-2tr(0ACiO0T  diag(0CiO0T))  +2tr(0QoA0T  diag(0CiO©T)) 

Then  using  (A. 1), (A. 2), (A. 3)  and  orthogonality  of  0  we  have: 

Ji  =  — 2tr (0T0CiO0T  diag(0C'iO0T)©A)  +2tr(0T  diag(0CiO0T)0C'iO0T0A) 

=  2tr  f©T [diag(0CiO0T),  0CiO0T]  ©A^j  =  2tr  ^0T [diag(0CiO0T),  ©Oo©7]  ©^ 

where  [A,  B]  =  AB  —  BA  is  the  matrix  Lie  Bracket.  By  definition  of  the  Rieman- 
nian  metric  in  equation  (5.2)  and  the  definition  of  gradient  with  respect  to  this 
metric  we  have: 

VJi  =  2(0T[diag(0Cio0T),0C'iO0T])T  =  -2[diag(0CiO0T),  ©C*,©7]  ©  = 
-20[0Tdiag(0CiO0T)0,C'o] 

Accordingly: 

N  N  N 

yj  =  YJJi  =  ~ 2©^  [eTdiag(0CiO0r)0,Qo]  =  -2  [diag(0C7io0r),0Cio0T]  © 

i= 1  i=  1  i= 1 

Then  the  gradient  flow  for  minimization  of  «/(©)  will  be: 

(A. 8) 

which  proves  the  second  part  of  the  theorem. 
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A. 2  Proof  of  Theorem  5.2 


From  Cj  =  0Cjo©T  and  by  (A. 8)  we  have: 

N 

Cj  =  ecj0eT  +  ®cj0QT  =  Y 

i= 1 


[diag(0CiO0T),  eCi O0T]  OCj O0T- 


0Cio0r[diag(0C'iO0T),0C'iO0T]  =  Y 


N 


2=1  L 


[diag(0C'iO0T),  ©ao0T] ,  0CiO02 


-E 


[diagfCj),  Q],  Ci 


^2  [diag(C'j) ,  Cj\ ,  Cj 
-  2=1 


In  deriving  the  last  expression  we  used  the  definition  of  the  Lie  Bracket. 


A. 3  Proof  of  Theorem  6.1 


We  name  each  single  term  in  the  summation  as  Jj{B )  that  is: 

N  N 

Ji(B)  =  Y  MB)  =  Y  II BCiBT  -  di&g(BCiBT)\\2F  (A.9) 

2=1  2=1 

then  note  that  using  identities  (A.l),  (A. 2)  and  the  symmetry  of  Ci  s  we  have: 

Ji  =  ti((BCiBT  -diag(BCiBT))(BCiBT  -diag(BCiBT))SJ  =  tr{BCiBT  BCiBT) 
—2tr(BCiBTdia,g(BCiBT))  +  tr(diag(BCiBT)diag(BCiBT))  =  tr  (BCiBT  BCiBT) 

—tr(BCiBTdiag(BCiBT)) 

Next  we  compute  the  time  derivative  of  Ji(B)  in  terms  of  the  time  derivative  of  B 
i.e.  B  €  TsGL{n).  Using  the  symmetries  and  the  same  identities  we  will  have: 


Ji(B)  =  4tr (CiBTBCiBTB)  -  dtY{CiBTdiag{BCiBT)B)  = 

4ti  ({CiBTBCiBT  -  CiBTdiag{BCiBT))B  \  =  tr (XB)  (A.  10) 


103 


Now  from  the  definition  of  the  Natural  Riemannian  metric  in  equation  (6.2)  and 
the  corresponding  gradient  we  have  that: 

ji  =  tr{{BT  B)~l\7  Jj  B) 

hence  from  (A.  10)  we  have  that:  VJ*  =  XTBTB,  therefore: 

VJ*  =  4:{(BCtBT  -  di&g(BCiBT))BCiBT)B 

and  VJi  =  VJ*  yields: 

VJi(B)  =  4^  ( BCiBT  -  diag(BCiBT))BCiBAB 
'  1=1  ' 

which  is  the  first  statement  of  the  theorem  and  the  next  statement  is  simply  letting 
B  =  -\VJ1{B). 
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